Package org.apache.nutch.collection
Class Subcollection
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.collection.Subcollection
-
- All Implemented Interfaces:
Configurable,URLFilter,Pluggable
public class Subcollection extends Configured implements URLFilter
SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
-
-
Field Summary
Fields Modifier and Type Field Description static StringTAG_BLACKLISTstatic StringTAG_COLLECTIONstatic StringTAG_COLLECTIONSstatic StringTAG_IDstatic StringTAG_KEYstatic StringTAG_NAMEstatic StringTAG_WHITELIST-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Subcollection(String id, String name, String key, Configuration conf)public ConstructorSubcollection(String id, String name, Configuration conf)public ConstructorSubcollection(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Stringfilter(String urlString)Simple "indexOf" currentFilter for matching patterns.StringgetBlackListString()Returns blacklist StringStringgetId()StringgetKey()StringgetName()List<String>getWhiteList()Returns whitelistStringgetWhiteListString()Returns whitelist Stringvoidinitialize(Element collection)Initialize Subcollection from dom elementprotected voidparseList(List<String> list, String text)Create a list of patterns from a chunk of text, patterns are separated with a newlinevoidsetBlackList(String list)Set contents of blacklist from StringvoidsetWhiteList(String list)Set contents of whitelist from StringvoidsetWhiteList(ArrayList<String> whiteList)-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
TAG_COLLECTIONS
public static final String TAG_COLLECTIONS
- See Also:
- Constant Field Values
-
TAG_COLLECTION
public static final String TAG_COLLECTION
- See Also:
- Constant Field Values
-
TAG_WHITELIST
public static final String TAG_WHITELIST
- See Also:
- Constant Field Values
-
TAG_BLACKLIST
public static final String TAG_BLACKLIST
- See Also:
- Constant Field Values
-
TAG_NAME
public static final String TAG_NAME
- See Also:
- Constant Field Values
-
TAG_KEY
public static final String TAG_KEY
- See Also:
- Constant Field Values
-
TAG_ID
public static final String TAG_ID
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Subcollection
public Subcollection(String id, String name, Configuration conf)
public Constructor- Parameters:
id- Id of SubCollectionname- Name of SubCollectionconf- A populatedConfiguration
-
Subcollection
public Subcollection(String id, String name, String key, Configuration conf)
public Constructor- Parameters:
id- Id of SubCollectionname- Name of SubCollectionkey- SubCollection keyconf- A populatedConfiguration
-
Subcollection
public Subcollection(Configuration conf)
-
-
Method Detail
-
getName
public String getName()
- Returns:
- Returns the name
-
getKey
public String getKey()
- Returns:
- Returns the key
-
getId
public String getId()
- Returns:
- Returns the id
-
getWhiteListString
public String getWhiteListString()
Returns whitelist String- Returns:
- Whitelist String
-
getBlackListString
public String getBlackListString()
Returns blacklist String- Returns:
- Blacklist String
-
setWhiteList
public void setWhiteList(ArrayList<String> whiteList)
- Parameters:
whiteList- The whiteList to set.
-
filter
public String filter(String urlString)
Simple "indexOf" currentFilter for matching patterns.rules for evaluation are as follows: 1. if pattern matches in blacklist then url is rejected 2. if pattern matches in whitelist then url is allowed 3. url is rejected
- Specified by:
filterin interfaceURLFilter- Parameters:
urlString- the URL string the filter is applied on- Returns:
- the original URL string if the URL is accepted by the filter or null in case the URL is rejected
- See Also:
URLFilter.filter(java.lang.String)
-
initialize
public void initialize(Element collection)
Initialize Subcollection from dom element- Parameters:
collection- A DOMElementfor use in creating theSubcollection
-
parseList
protected void parseList(List<String> list, String text)
Create a list of patterns from a chunk of text, patterns are separated with a newline- Parameters:
list- An initializedListto insert String patterns.text- A chunkl fo text (hopefully) containing patterns.
-
setBlackList
public void setBlackList(String list)
Set contents of blacklist from String- Parameters:
list- the blacklist contents
-
setWhiteList
public void setWhiteList(String list)
Set contents of whitelist from String- Parameters:
list- the whitelist contents
-
-