Package org.apache.nutch.crawl
Class CrawlDbReader
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.AbstractChecker
-
- org.apache.nutch.crawl.CrawlDbReader
-
- All Implemented Interfaces:
Closeable,AutoCloseable,Configurable,Tool
public class CrawlDbReader extends AbstractChecker implements Closeable
Read utility for the CrawlDB.- Author:
- Andrzej Bialecki
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classCrawlDbReader.CrawlDatumCsvOutputFormatstatic classCrawlDbReader.CrawlDatumJsonOutputFormatstatic classCrawlDbReader.CrawlDbDumpMapperstatic classCrawlDbReader.CrawlDbStatMapperstatic classCrawlDbReader.CrawlDbStatReducerstatic classCrawlDbReader.CrawlDbTopNMapperstatic classCrawlDbReader.CrawlDbTopNReducerstatic classCrawlDbReader.JsonIndenter
-
Field Summary
Fields Modifier and Type Field Description protected StringcrawlDb-
Fields inherited from class org.apache.nutch.util.AbstractChecker
keepClientCnxOpen, stdin, tcpPort, usage
-
-
Constructor Summary
Constructors Constructor Description CrawlDbReader()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()CrawlDatumget(String crawlDb, String url, Configuration config)static voidmain(String[] args)protected intprocess(String line, StringBuilder output)voidprocessDumpJob(String crawlDb, String output, Configuration config, String format, String regex, String status, Integer retry, String expr, Float sample)voidprocessStatJob(String crawlDb, Configuration config, boolean sort)voidprocessTopNJob(String crawlDb, long topN, float min, String output, Configuration config)Objectquery(Map<String,String> args, Configuration conf, String type, String crawlId)voidreadUrl(String crawlDb, String url, Configuration config, StringBuilder output)intrun(String[] args)-
Methods inherited from class org.apache.nutch.util.AbstractChecker
getProtocolOutput, parseArgs, processSingle, processStdin, processTCP, run
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
crawlDb
protected String crawlDb
-
-
Method Detail
-
close
public void close()
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable
-
processStatJob
public void processStatJob(String crawlDb, Configuration config, boolean sort) throws IOException, InterruptedException, ClassNotFoundException
-
get
public CrawlDatum get(String crawlDb, String url, Configuration config) throws IOException
- Throws:
IOException
-
process
protected int process(String line, StringBuilder output) throws Exception
- Specified by:
processin classAbstractChecker- Throws:
Exception
-
readUrl
public void readUrl(String crawlDb, String url, Configuration config, StringBuilder output) throws IOException
- Throws:
IOException
-
processDumpJob
public void processDumpJob(String crawlDb, String output, Configuration config, String format, String regex, String status, Integer retry, String expr, Float sample) throws IOException, ClassNotFoundException, InterruptedException
-
processTopNJob
public void processTopNJob(String crawlDb, long topN, float min, String output, Configuration config) throws IOException, ClassNotFoundException, InterruptedException
-
run
public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException, Exception
- Specified by:
runin interfaceTool- Throws:
IOExceptionInterruptedExceptionClassNotFoundExceptionException
-
-