Package org.apache.nutch.hostdb
Class UpdateHostDbMapper
- java.lang.Object
-
- org.apache.hadoop.mapreduce.Mapper<Text,Writable,Text,NutchWritable>
-
- org.apache.nutch.hostdb.UpdateHostDbMapper
-
public class UpdateHostDbMapper extends Mapper<Text,Writable,Text,NutchWritable>
Mapper ingesting HostDB and CrawlDB entries. Additionally it can also read host score info from a plain text key/value file generated by the Webgraph's NodeDumper tool.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper
Mapper.Context
-
-
Field Summary
Fields Modifier and Type Field Description protected String[]argsprotected Stringbufferprotected CrawlDatumcrawlDatumprotected booleanfilterprotected URLFiltersfiltersprotected Texthostprotected HostDatumhostDatumprotected booleannormalizeprotected URLNormalizersnormalizersprotected booleanreadingCrawlDbprotected StringreprUrl
-
Constructor Summary
Constructors Constructor Description UpdateHostDbMapper()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected StringfilterNormalize(String hostName)Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".voidmap(Text key, Writable value, Mapper.Context context)Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file.voidsetup(Mapper.Context context)
-
-
-
Field Detail
-
host
protected Text host
-
hostDatum
protected HostDatum hostDatum
-
crawlDatum
protected CrawlDatum crawlDatum
-
reprUrl
protected String reprUrl
-
buffer
protected String buffer
-
args
protected String[] args
-
filter
protected boolean filter
-
normalize
protected boolean normalize
-
readingCrawlDb
protected boolean readingCrawlDb
-
filters
protected URLFilters filters
-
normalizers
protected URLNormalizers normalizers
-
-
Method Detail
-
setup
public void setup(Mapper.Context context)
-
filterNormalize
protected String filterNormalize(String hostName)
Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".- Parameters:
hostName- the input hostname- Returns:
- the normalized hostname, or null if the URL is excluded by URL filters or failed to be normalized converted
-
map
public void map(Text key, Writable value, Mapper.Context context) throws IOException, InterruptedException
Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file. Statistics and scores are passed on.- Overrides:
mapin classMapper<Text,Writable,Text,NutchWritable>- Parameters:
key- recordTextkeyvalue- associatedWritableobjectcontext-Reducer.Contextfor writing custom counters and output.- Throws:
IOExceptionInterruptedException
-
-