Package org.apache.nutch.indexer.basic
Class BasicIndexingFilter
- java.lang.Object
-
- org.apache.nutch.indexer.basic.BasicIndexingFilter
-
- All Implemented Interfaces:
Configurable,IndexingFilter,Pluggable
public class BasicIndexingFilter extends Object implements IndexingFilter
Adds basic searchable fields to a document. The fields added are : domain, host, url, content, title, cache, tstamp domain is included depending onindexer.add.domainin nutch-default.xml. title is truncated as perindexer.max.title.lengthin nutch-default.xml. (As per NUTCH-1004, a zero-length title is not added) content is truncated as perindexer.max.content.lengthin nutch-default.xml.
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description BasicIndexingFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description NutchDocumentfilter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)TheBasicIndexingFilterfilter object which supports few configuration settings for adding basic searchable fields.ConfigurationgetConf()Get theConfigurationobjectvoidsetConf(Configuration conf)Set theConfigurationobject
-
-
-
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
TheBasicIndexingFilterfilter object which supports few configuration settings for adding basic searchable fields. Seeindexer.add.domain,indexer.max.title.length,indexer.max.content.lengthin nutch-default.xml.- Specified by:
filterin interfaceIndexingFilter- Parameters:
doc- TheNutchDocumentobjectparse- The relevantParseobject passing through the filterurl- URL to be filtered for anchor textdatum- TheCrawlDatumentryinlinks- TheInlinkscontaining anchor text- Returns:
- filtered NutchDocument
- Throws:
IndexingException- if an error occurs during during filtering
-
setConf
public void setConf(Configuration conf)
Set theConfigurationobject- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfigurationobject- Specified by:
getConfin interfaceConfigurable
-
-