Package org.apache.nutch.parse
Class ParserChecker
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.AbstractChecker
-
- org.apache.nutch.parse.ParserChecker
-
- All Implemented Interfaces:
Configurable,Tool
public class ParserChecker extends AbstractChecker
Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:contentType: The URLContenttype.signature: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated usingMD5SignatureorTextProfileSignature.Version: FromParseData.Status: FromParseData.Title: of the URLOutlinks: associated with the URLContent Metadata: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.Parse Metadata: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.ParseText: The page parse text which varies in length depdnecing oncontent.lengthconfiguration.
-
-
Field Summary
Fields Modifier and Type Field Description protected booleancheckRobotsTxtprotected booleandumpTextprotected booleanfollowRedirectsprotected StringforceAsContentTypeprotected HashMap<String,String>metadataprotected URLNormalizersnormalizers-
Fields inherited from class org.apache.nutch.util.AbstractChecker
keepClientCnxOpen, stdin, tcpPort, usage
-
-
Constructor Summary
Constructors Constructor Description ParserChecker()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidmain(String[] args)protected intprocess(String url, StringBuilder output)intrun(String[] args)-
Methods inherited from class org.apache.nutch.util.AbstractChecker
getProtocolOutput, parseArgs, processSingle, processStdin, processTCP, run
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
normalizers
protected URLNormalizers normalizers
-
dumpText
protected boolean dumpText
-
followRedirects
protected boolean followRedirects
-
checkRobotsTxt
protected boolean checkRobotsTxt
-
forceAsContentType
protected String forceAsContentType
-
-
Method Detail
-
process
protected int process(String url, StringBuilder output) throws Exception
- Specified by:
processin classAbstractChecker- Throws:
Exception
-
-