Package org.apache.nutch.util
Class SitemapProcessor
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.SitemapProcessor
-
- All Implemented Interfaces:
Configurable,Tool
public class SitemapProcessor extends Configured implements Tool
Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.
There are two use cases supported in Nutch's sitemap processing:
- Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a list of sitemap links and inject and fetch only the pages listed in the sitemaps. This suits well for targeted crawl of specific hosts.
- For an open web crawl, it is not possible to track each host and get the sitemap links manually. Nutch automatically detects the sitemaps for all hosts seen in the crawls and present in the HostDb and injects the URLs from the sitemaps into the CrawlDb.
- See Also:
- SitemapFeature
-
-
Field Summary
Fields Modifier and Type Field Description static StringCURRENT_NAMEstatic StringLOCK_NAMEstatic SimpleDateFormatsdfstatic StringSITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOTstatic StringSITEMAP_OVERWRITE_EXISTINGstatic StringSITEMAP_REDIR_MAXstatic StringSITEMAP_SIZE_MAXstatic StringSITEMAP_STRICT_PARSINGstatic StringSITEMAP_URL_FILTERINGstatic StringSITEMAP_URL_NORMALIZING
-
Constructor Summary
Constructors Constructor Description SitemapProcessor()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidmain(String[] args)intrun(String[] args)voidsitemap(Path crawldb, Path hostdb, Path sitemapUrlDir, boolean strict, boolean filter, boolean normalize, int threads)static voidusage()-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
sdf
public static final SimpleDateFormat sdf
-
CURRENT_NAME
public static final String CURRENT_NAME
- See Also:
- Constant Field Values
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
-
SITEMAP_STRICT_PARSING
public static final String SITEMAP_STRICT_PARSING
- See Also:
- Constant Field Values
-
SITEMAP_URL_FILTERING
public static final String SITEMAP_URL_FILTERING
- See Also:
- Constant Field Values
-
SITEMAP_URL_NORMALIZING
public static final String SITEMAP_URL_NORMALIZING
- See Also:
- Constant Field Values
-
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
public static final String SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
- See Also:
- Constant Field Values
-
SITEMAP_OVERWRITE_EXISTING
public static final String SITEMAP_OVERWRITE_EXISTING
- See Also:
- Constant Field Values
-
SITEMAP_REDIR_MAX
public static final String SITEMAP_REDIR_MAX
- See Also:
- Constant Field Values
-
SITEMAP_SIZE_MAX
public static final String SITEMAP_SIZE_MAX
- See Also:
- Constant Field Values
-
-