Package org.apache.nutch.protocol
Class RobotRulesParser
- java.lang.Object
-
- org.apache.nutch.protocol.RobotRulesParser
-
- All Implemented Interfaces:
Configurable,Tool
- Direct Known Subclasses:
FtpRobotRulesParser,HttpRobotRulesParser
public abstract class RobotRulesParser extends Object implements Tool
This class uses crawler-commons for handling the parsing ofrobots.txtfiles. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser. Protocol-specific implementations have to implement the methodgetRobotRulesSet(org.apache.nutch.protocol.Protocol,org.apache.hadoop.io.Text,java.util.List<org.apache.nutch.protocol.Content>).
-
-
Field Summary
Fields Modifier and Type Field Description protected Set<String>agentNamesprotected Set<String>allowListset of host names or IPs to be explicitly excluded from robots.txt checkingprotected static Hashtable<String,crawlercommons.robots.BaseRobotRules>CACHEprotected Configurationconfstatic crawlercommons.robots.BaseRobotRulesDEFER_VISIT_RULESABaseRobotRulesobject appropriate for use when therobots.txtfile failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code.static crawlercommons.robots.BaseRobotRulesEMPTY_RULESABaseRobotRulesobject appropriate for use when therobots.txtfile is empty or missing; all requests are allowed.static crawlercommons.robots.BaseRobotRulesFORBID_ALL_RULESABaseRobotRulesobject appropriate for use when therobots.txtfile is not fetched due to a403/Forbiddenresponse; all requests are disallowed.protected intmaxNumRedirects
-
Constructor Summary
Constructors Constructor Description RobotRulesParser()RobotRulesParser(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods Modifier and Type Method Description ConfigurationgetConf()Get theConfigurationobjectabstract crawlercommons.robots.BaseRobotRulesgetRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).crawlercommons.robots.BaseRobotRulesgetRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).booleanisAllowListed(URL url)Check whether a URL belongs to a allowlisted host.static voidmain(String[] args)crawlercommons.robots.BaseRobotRulesparseRules(String url, byte[] content, String contentType, String robotName)Deprecated.crawlercommons.robots.BaseRobotRulesparseRules(String url, byte[] content, String contentType, Collection<String> robotNames)Parses the robots content using theSimpleRobotRulesParserfrom crawler-commonsintrun(String[] args)voidsetConf(Configuration conf)Set theConfigurationobject
-
-
-
Field Detail
-
EMPTY_RULES
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
ABaseRobotRulesobject appropriate for use when therobots.txtfile is empty or missing; all requests are allowed.
-
FORBID_ALL_RULES
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
ABaseRobotRulesobject appropriate for use when therobots.txtfile is not fetched due to a403/Forbiddenresponse; all requests are disallowed.
-
DEFER_VISIT_RULES
public static final crawlercommons.robots.BaseRobotRules DEFER_VISIT_RULES
ABaseRobotRulesobject appropriate for use when therobots.txtfile failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. The crawler should suspend crawling for a certain (but not too long) time, see propertyhttp.robots.503.defer.visits.
-
conf
protected Configuration conf
-
maxNumRedirects
protected int maxNumRedirects
-
-
Constructor Detail
-
RobotRulesParser
public RobotRulesParser()
-
RobotRulesParser
public RobotRulesParser(Configuration conf)
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
Set theConfigurationobject- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfigurationobject- Specified by:
getConfin interfaceConfigurable
-
isAllowListed
public boolean isAllowListed(URL url)
Check whether a URL belongs to a allowlisted host.- Parameters:
url- aURLto check against rules- Returns:
- true if always allowed (robots.txt rules are ignored), false otherwise
-
parseRules
@Deprecated public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
Deprecated.Parses the robots content using theSimpleRobotRulesParserfrom crawler-commons- Parameters:
url- The robots.txt URLcontent- Contents of the robots file in a byte arraycontentType- The content type of the robots filerobotName- A string containing all the robots agent names used by parser for matching- Returns:
- BaseRobotRules object
-
parseRules
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, Collection<String> robotNames)
Parses the robots content using theSimpleRobotRulesParserfrom crawler-commons- Parameters:
url- The robots.txt URLcontent- Contents of the robots file in a byte arraycontentType- The content type of the robots filerobotNames- A collection containing all the robots agent names used by parser for matching- Returns:
- BaseRobotRules object
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).- Parameters:
protocol-Protocolurl- URL to checkrobotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContentis appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
getRobotRulesSet
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).- Parameters:
protocol-Protocolurl- URL to checkrobotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContentis appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-