Package org.apache.nutch.analysis.lang
Class HTMLLanguageParser
- java.lang.Object
-
- org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- All Implemented Interfaces:
Configurable,HtmlParseFilter,Pluggable
public class HTMLLanguageParser extends Object implements HtmlParseFilter
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description HTMLLanguageParser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResultfilter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Scan the HTML document looking at possible indications of content language
1.ConfigurationgetConf()voidsetConf(Configuration conf)
-
-
-
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
- 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified -html.shtml#language)
- 3. meta http-equiv (content-language)
(http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
- Specified by:
filterin interfaceHtmlParseFilter- Parameters:
content- theContentfor a given responseparseResult- the result of running on or moreParser's on the content.metaTags- a populatedHTMLMetaTagsobjectdoc- aDocumentFragment(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult - See Also:
Parser.getParse(Content)
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
-