Package org.apache.nutch.crawl
Class DeduplicationJob
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.DeduplicationJob
-
- All Implemented Interfaces:
Configurable,Tool
public class DeduplicationJob extends NutchTool implements Tool
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. The documents marked as duplicate can then be deleted with the command CleaningJob.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classDeduplicationJob.DBFilterstatic classDeduplicationJob.DedupReducer<K extends Writable>static classDeduplicationJob.StatusUpdateReducerCombine multiple new entries for a url.
-
Field Summary
Fields Modifier and Type Field Description protected static StringDEDUPLICATION_COMPARE_ORDERprotected static StringDEDUPLICATION_GROUP_MODEprotected static TexturlKeyprotected static StringUTF_8-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description DeduplicationJob()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidmain(String[] args)intrun(String[] args)Map<String,Object>run(Map<String,Object> args, String crawlId)Runs the tool, using a map of arguments.-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
urlKey
protected static final Text urlKey
-
DEDUPLICATION_GROUP_MODE
protected static final String DEDUPLICATION_GROUP_MODE
- See Also:
- Constant Field Values
-
DEDUPLICATION_COMPARE_ORDER
protected static final String DEDUPLICATION_COMPARE_ORDER
- See Also:
- Constant Field Values
-
UTF_8
protected static final String UTF_8
-
-
Method Detail
-
run
public int run(String[] args) throws IOException
- Specified by:
runin interfaceTool- Throws:
IOException
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchToolRuns the tool, using a map of arguments. May return results, or null.- Specified by:
runin classNutchTool- Parameters:
args- aMapof arguments to be run with the toolcrawlId- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception- if there is an error during the tool execution
-
-