Uses of Class
org.apache.nutch.protocol.Content
-
Packages that use Content Package Description org.apache.nutch.analysis.lang Text document language identifier.org.apache.nutch.crawl Crawl control code and tools to run the crawler.org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.org.apache.nutch.parse TheParseinterface and related classes.org.apache.nutch.parse.ext Parse wrapper to run external command to do the parsing.org.apache.nutch.parse.feed Parse RSS feeds.org.apache.nutch.parse.headings Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.org.apache.nutch.parse.html An HTML document parsing plugin.org.apache.nutch.parse.js Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.org.apache.nutch.parse.metatags Parse filter to extract meta tags: keywords, description, etc.org.apache.nutch.parse.tika Parse various document formats with help of Apache Tika.org.apache.nutch.parse.zip Parse ZIP files: embedded files are recursively passed to appropriate parsers.org.apache.nutch.parsefilter.debug Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).org.apache.nutch.parsefilter.naivebayes Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.org.apache.nutch.parsefilter.regex RegexParseFilter.org.apache.nutch.protocol Classes related to theProtocolinterface, see alsoorg.apache.nutch.net.protocols.org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http,httpclient, etc.)org.apache.nutch.scoring TheScoringFilterinterface.org.apache.nutch.scoring.depth Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).org.apache.nutch.scoring.link Scoring filter used in conjunction withWebGraph.org.apache.nutch.scoring.metadata Metadata Scoring Pluginorg.apache.nutch.scoring.opic Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.org.apache.nutch.scoring.similarity org.apache.nutch.scoring.similarity.cosine Implements the cosine similarity metric for scoring relevant documentsorg.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Pluginorg.apache.nutch.segment A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.org.apache.nutch.tools Miscellaneous tools.org.apache.nutch.util Miscellaneous utility classes.org.creativecommons.nutch Sample plugins that parse and index Creative Commons metadata. -
-
Uses of Content in org.apache.nutch.analysis.lang
Methods in org.apache.nutch.analysis.lang with parameters of type Content Modifier and Type Method Description ParseResultHTMLLanguageParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Scan the HTML document looking at possible indications of content language
1. -
Uses of Content in org.apache.nutch.crawl
Methods in org.apache.nutch.crawl with parameters of type Content Modifier and Type Method Description byte[]MD5Signature. calculate(Content content, Parse parse)abstract byte[]Signature. calculate(Content content, Parse parse)byte[]TextMD5Signature. calculate(Content content, Parse parse)byte[]TextProfileSignature. calculate(Content content, Parse parse) -
Uses of Content in org.apache.nutch.microformats.reltag
Methods in org.apache.nutch.microformats.reltag with parameters of type Content Modifier and Type Method Description ParseResultRelTagParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Scan the HTML document looking at possible rel-tags -
Uses of Content in org.apache.nutch.parse
Methods in org.apache.nutch.parse with parameters of type Content Modifier and Type Method Description ParseResultHtmlParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.ParseResultHtmlParseFilters. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Run all defined filters.ParseResultParser. getParse(Content c)This method parses the given content and returns a map of <key, parse> pairs.static booleanParseSegment. isTruncated(Content content)Checks if the page's content is truncated.voidParseSegment.ParseSegmentMapper. map(WritableComparable<?> key, Content content, Mapper.Context context)ParseResultParseUtil. parse(Content content)ParseResultParseUtil. parseByExtensionId(String extId, Content content) -
Uses of Content in org.apache.nutch.parse.ext
Methods in org.apache.nutch.parse.ext with parameters of type Content Modifier and Type Method Description ParseResultExtParser. getParse(Content content) -
Uses of Content in org.apache.nutch.parse.feed
Methods in org.apache.nutch.parse.feed with parameters of type Content Modifier and Type Method Description ParseResultFeedParser. getParse(Content content)Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library. -
Uses of Content in org.apache.nutch.parse.headings
Methods in org.apache.nutch.parse.headings with parameters of type Content Modifier and Type Method Description ParseResultHeadingsParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) -
Uses of Content in org.apache.nutch.parse.html
Methods in org.apache.nutch.parse.html with parameters of type Content Modifier and Type Method Description ParseResultHtmlParser. getParse(Content content) -
Uses of Content in org.apache.nutch.parse.js
Methods in org.apache.nutch.parse.js with parameters of type Content Modifier and Type Method Description ParseResultJSParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Scan the JavaScript fragments of a HTML page looking for possibleOutlink'sParseResultJSParseFilter. getParse(Content c)Parse a JavaScript file and extract outlinks -
Uses of Content in org.apache.nutch.parse.metatags
Methods in org.apache.nutch.parse.metatags with parameters of type Content Modifier and Type Method Description ParseResultMetaTagsParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) -
Uses of Content in org.apache.nutch.parse.tika
Methods in org.apache.nutch.parse.tika with parameters of type Content Modifier and Type Method Description ParseResultTikaParser. getParse(Content content) -
Uses of Content in org.apache.nutch.parse.zip
Methods in org.apache.nutch.parse.zip with parameters of type Content Modifier and Type Method Description ParseResultZipParser. getParse(Content content) -
Uses of Content in org.apache.nutch.parsefilter.debug
Methods in org.apache.nutch.parsefilter.debug with parameters of type Content Modifier and Type Method Description ParseResultDebugParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) -
Uses of Content in org.apache.nutch.parsefilter.naivebayes
Methods in org.apache.nutch.parsefilter.naivebayes with parameters of type Content Modifier and Type Method Description ParseResultNaiveBayesParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) -
Uses of Content in org.apache.nutch.parsefilter.regex
Methods in org.apache.nutch.parsefilter.regex with parameters of type Content Modifier and Type Method Description ParseResultRegexParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) -
Uses of Content in org.apache.nutch.protocol
Methods in org.apache.nutch.protocol that return Content Modifier and Type Method Description ContentProtocolOutput. getContent()static ContentContent. read(DataInput in)Methods in org.apache.nutch.protocol with parameters of type Content Modifier and Type Method Description voidProtocolOutput. setContent(Content content)Method parameters in org.apache.nutch.protocol with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRulesProtocol. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)Retrieve robot rules applicable for this URL.abstract crawlercommons.robots.BaseRobotRulesRobotRulesParser. getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).crawlercommons.robots.BaseRobotRulesRobotRulesParser. getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).Constructors in org.apache.nutch.protocol with parameters of type Content Constructor Description ProtocolOutput(Content content)ProtocolOutput(Content content, ProtocolStatus status) -
Uses of Content in org.apache.nutch.protocol.file
Methods in org.apache.nutch.protocol.file that return Content Modifier and Type Method Description ContentFileResponse. toContent()Method parameters in org.apache.nutch.protocol.file with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRulesFile. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)No robots parsing is done for file protocol. -
Uses of Content in org.apache.nutch.protocol.ftp
Methods in org.apache.nutch.protocol.ftp that return Content Modifier and Type Method Description ContentFtpResponse. toContent()Method parameters in org.apache.nutch.protocol.ftp with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRulesFtp. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)Get the robots rules for a given urlcrawlercommons.robots.BaseRobotRulesFtpRobotRulesParser. getRobotRulesSet(Protocol ftp, URL url, List<Content> robotsTxtContent)The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to theURLpassed, gets robots file, parses the rules and caches the rules object to avoid re-work in future. -
Uses of Content in org.apache.nutch.protocol.http.api
Method parameters in org.apache.nutch.protocol.http.api with type arguments of type Content Modifier and Type Method Description protected voidHttpRobotRulesParser. addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)AppendContentof robots.txt to robotsTxtContentcrawlercommons.robots.BaseRobotRulesHttpBase. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)crawlercommons.robots.BaseRobotRulesHttpRobotRulesParser. getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)Get the rules from robots.txt which applies for the givenurl. -
Uses of Content in org.apache.nutch.scoring
Methods in org.apache.nutch.scoring with parameters of type Content Modifier and Type Method Description voidAbstractScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)voidScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)Currently a part of score distribution is performed using only data coming from the parsing process.voidScoringFilters. passScoreAfterParsing(Text url, Content content, Parse parse)voidAbstractScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)voidScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContentmetadata.voidScoringFilters. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) -
Uses of Content in org.apache.nutch.scoring.depth
Methods in org.apache.nutch.scoring.depth with parameters of type Content Modifier and Type Method Description voidDepthScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)voidDepthScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) -
Uses of Content in org.apache.nutch.scoring.link
Methods in org.apache.nutch.scoring.link with parameters of type Content Modifier and Type Method Description voidLinkAnalysisScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)voidLinkAnalysisScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) -
Uses of Content in org.apache.nutch.scoring.metadata
Methods in org.apache.nutch.scoring.metadata with parameters of type Content Modifier and Type Method Description voidMetadataScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)Takes the metadata, which was lumped inside the content, and replicates it within your parse data.voidMetadataScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)Takes the metadata, specified in your "scoring.db.md" property, from the datum object and injects it into the content. -
Uses of Content in org.apache.nutch.scoring.opic
Methods in org.apache.nutch.scoring.opic with parameters of type Content Modifier and Type Method Description voidOPICScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.voidOPICScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. -
Uses of Content in org.apache.nutch.scoring.similarity
Methods in org.apache.nutch.scoring.similarity with parameters of type Content Modifier and Type Method Description voidSimilarityScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)floatSimilarityModel. setURLScoreAfterParsing(Text url, Content content, Parse parse) -
Uses of Content in org.apache.nutch.scoring.similarity.cosine
Methods in org.apache.nutch.scoring.similarity.cosine with parameters of type Content Modifier and Type Method Description floatCosineSimilarity. setURLScoreAfterParsing(Text url, Content content, Parse parse) -
Uses of Content in org.apache.nutch.scoring.urlmeta
Methods in org.apache.nutch.scoring.urlmeta with parameters of type Content Modifier and Type Method Description voidURLMetaScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)Takes the metadata, which was lumped inside the content, and replicates it within your parse data.voidURLMetaScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content. -
Uses of Content in org.apache.nutch.segment
Methods in org.apache.nutch.segment with parameters of type Content Modifier and Type Method Description booleanSegmentMergeFilter. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)The filtering method which gets all information being merged for a given key (URL).booleanSegmentMergeFilters. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)Iterates over allSegmentMergeFilterextensions and if any of them returns false, it will return false as well. -
Uses of Content in org.apache.nutch.tools
Fields in org.apache.nutch.tools declared as Content Modifier and Type Field Description protected ContentAbstractCommonCrawlFormat. contentMethods in org.apache.nutch.tools with parameters of type Content Modifier and Type Method Description static CommonCrawlFormatCommonCrawlFormatFactory. getCommonCrawlFormat(String formatType, String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)Deprecated.StringAbstractCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata)StringAbstractCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)StringCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata)Returns a string representation of the JSON structure of the URL content.StringCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)Returns a string representation of the JSON structure of the URL content.StringCommonCrawlFormatWARC. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)Constructors in org.apache.nutch.tools with parameters of type Content Constructor Description AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)CommonCrawlFormatJackson(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)CommonCrawlFormatJettinson(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)CommonCrawlFormatSimple(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData) -
Uses of Content in org.apache.nutch.util
Methods in org.apache.nutch.util with parameters of type Content Modifier and Type Method Description voidEncodingDetector. autoDetectClues(Content content, boolean filter)StringEncodingDetector. guessEncoding(Content content, String defaultValue)Guess the encoding with the previously specified list of clues. -
Uses of Content in org.creativecommons.nutch
Methods in org.creativecommons.nutch with parameters of type Content Modifier and Type Method Description ParseResultCCParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
-