Class URLNormalizers
- java.lang.Object
-
- org.apache.nutch.net.URLNormalizers
-
public final class URLNormalizers extends Object
This class uses a "chained filter" pattern to run defined normalizers. Different lists of normalizers may be defined for different "scopes", or contexts where they are used (note however that they need to be activated first throughplugin.includeproperty).There is one global scope defined by default, which consists of all active normalizers. The order in which these normalizers are executed may be defined in "urlnormalizer.order" property, which lists space-separated implementation classes (if this property is missing normalizers will be run in random order). If there are more normalizers activated than explicitly named on this list, the remaining ones will be run in random order after the ones specified on the list are executed.
You can define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (defined in "urlnormalizer.scope.<scope_name>" property) and its own order (defined in "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used for the global scope.
In case no normalizers are required for any given scope, a
org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizershould be used.Each normalizer may further select among many configurations, depending on the scope in which it is called, because the scope name is passed as a parameter to each normalizer. You can also use the same normalizer for many scopes.
Several scopes have been defined, and various Nutch tools will attempt using scope-specific normalizers first (and fall back to default config if scope-specific configuration is missing).
Normalizers may be run several times, to ensure that modifications introduced by normalizers at the end of the list can be further reduced by normalizers executed at the beginning. By default this loop is executed just once - if you want to ensure that all possible combinations have been applied you may want to run this loop up to the number of activated normalizers. This loop count can be configured through
urlnormalizer.loop.countproperty. As soon as the url is unchanged the loop will stop and return the result.- Author:
- Andrzej Bialecki
-
-
Field Summary
Fields Modifier and Type Field Description static StringSCOPE_CRAWLDBScope used when updating the CrawlDb with new URLs.static StringSCOPE_DEFAULTDefault scope.static StringSCOPE_FETCHERScope used byFetcherwhen processing redirect URLs.static StringSCOPE_GENERATE_HOST_COUNTScope used byGenerator.static StringSCOPE_INDEXERScope used when indexing URLs.static StringSCOPE_INJECTScope used byInjector.static StringSCOPE_LINKDBScope used when updating the LinkDb with new URLs.static StringSCOPE_OUTLINKScope used when constructing newOutlinkinstances.static StringSCOPE_PARTITIONScope used byURLPartitioner.
-
Constructor Summary
Constructors Constructor Description URLNormalizers(Configuration conf, String scope)
-
-
-
Field Detail
-
SCOPE_DEFAULT
public static final String SCOPE_DEFAULT
Default scope. If no scope properties are defined then the configuration for this scope will be used.- See Also:
- Constant Field Values
-
SCOPE_PARTITION
public static final String SCOPE_PARTITION
Scope used byURLPartitioner.- See Also:
- Constant Field Values
-
SCOPE_GENERATE_HOST_COUNT
public static final String SCOPE_GENERATE_HOST_COUNT
Scope used byGenerator.- See Also:
- Constant Field Values
-
SCOPE_FETCHER
public static final String SCOPE_FETCHER
Scope used byFetcherwhen processing redirect URLs.- See Also:
- Constant Field Values
-
SCOPE_CRAWLDB
public static final String SCOPE_CRAWLDB
Scope used when updating the CrawlDb with new URLs.- See Also:
- Constant Field Values
-
SCOPE_LINKDB
public static final String SCOPE_LINKDB
Scope used when updating the LinkDb with new URLs.- See Also:
- Constant Field Values
-
SCOPE_INJECT
public static final String SCOPE_INJECT
Scope used byInjector.- See Also:
- Constant Field Values
-
SCOPE_OUTLINK
public static final String SCOPE_OUTLINK
Scope used when constructing newOutlinkinstances.- See Also:
- Constant Field Values
-
SCOPE_INDEXER
public static final String SCOPE_INDEXER
Scope used when indexing URLs.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
URLNormalizers
public URLNormalizers(Configuration conf, String scope)
-
-
Method Detail
-
normalize
public String normalize(String urlString, String scope) throws MalformedURLException
Normalize- Parameters:
urlString- The URL string to normalize.scope- The given scope.- Returns:
- A normalized String, using the given
scope - Throws:
MalformedURLException- If the given URL string is malformed.
-
-