Class HttpBase
- java.lang.Object
-
- org.apache.nutch.protocol.http.api.HttpBase
-
-
Field Summary
Fields Modifier and Type Field Description protected StringacceptThe "Accept" request header value.protected StringacceptCharsetThe "Accept-Charset" request header value.protected StringacceptLanguageThe "Accept-Language" request header value.static intBUFFER_SIZEstatic TextCOOKIEprotected booleanenableCookieHeaderControls whether or not to set Cookie HTTP header based on CrawlDatum metadataprotected booleanenableIfModifiedsinceHeaderConfiguration directive for If-Modified-Since HTTP headerprotected intmaxContentThe length limit for downloaded content, in bytes.protected longmaxCrawlDelaySkip page if Crawl-Delay longer than this value.protected intmaxDurationThe time limit to download the entire content, in seconds.protected booleanpartialAsTruncatedWhether to save partial fetches as truncated content.protected HashMap<String,String>proxyExceptionThe proxy exception list.protected StringproxyHostThe proxy hostname.protected intproxyPortThe proxy port.protected Proxy.TypeproxyTypeThe proxy port.static TextRESPONSE_TIMEprotected booleanresponseTimeRecord response time in CrawlDatum's meta data, see property http.store.responsetime.protected booleanstoreHttpHeadersRecord the HTTP response header in the metadata, see propertystore.http.headers.protected booleanstoreHttpRequestRecord the HTTP request in the metadata, see propertystore.http.request.protected booleanstoreIPAddressRecord the IP address of the responding server, see propertystore.ip.address.protected inttimeoutThe network timeout in millisecondprotected booleantlsCheckCertificateWhether to check TLS/SSL certificatesprotected Set<String>tlsPreferredCipherSuitesWhich TLS/SSL cipher suites to supportprotected Set<String>tlsPreferredProtocolsWhich TLS/SSL protocols to supportprotected booleanuseHttp11Do we use HTTP/1.1?protected booleanuseHttp2Whether to use HTTP/2protected booleanuseProxyIndicates if a proxy is usedprotected StringuserAgentThe Nutch 'User-Agent' request header-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description StringgetAccept()StringgetAcceptCharset()StringgetAcceptLanguage()Value of "Accept-Language" request header sent by Nutch.ConfigurationgetConf()StringgetCookie(URL url)If per-host cookies are configured, this method will look it up for the given url.intgetMaxContent()intgetMaxDuration()The time limit to download the entire content, in seconds.ProtocolOutputgetProtocolOutput(Text url, CrawlDatum datum)Get theProtocolOutputfor a given url and crawldatumStringgetProxyHost()intgetProxyPort()protected abstract ResponsegetResponse(URL url, CrawlDatum datum, boolean followRedirects)crawlercommons.robots.BaseRobotRulesgetRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)Retrieve robot rules applicable for this URL.intgetTimeout()Set<String>getTlsPreferredCipherSuites()Set<String>getTlsPreferredProtocols()booleangetUseHttp11()StringgetUserAgent()booleanisCookieEnabled()booleanisIfModifiedSinceEnabled()booleanisStoreHttpHeaders()booleanisStoreHttpRequest()booleanisStoreIPAddress()booleanisStorePartialAsTruncated()Whether to save partial fetches as truncated content, cf.booleanisTlsCheckCertificates()protected voidlogConf()protected static voidmain(HttpBase http, String[] args)byte[]processDeflateEncoded(byte[] compressed, URL url)byte[]processGzipEncoded(byte[] compressed, URL url)voidsetConf(Configuration conf)booleanuseProxy(String host)booleanuseProxy(URI uri)booleanuseProxy(URL url)
-
-
-
Field Detail
-
RESPONSE_TIME
public static final Text RESPONSE_TIME
-
COOKIE
public static final Text COOKIE
-
BUFFER_SIZE
public static final int BUFFER_SIZE
- See Also:
- Constant Field Values
-
proxyHost
protected String proxyHost
The proxy hostname.
-
proxyPort
protected int proxyPort
The proxy port.
-
proxyType
protected Proxy.Type proxyType
The proxy port.
-
useProxy
protected boolean useProxy
Indicates if a proxy is used
-
timeout
protected int timeout
The network timeout in millisecond
-
maxContent
protected int maxContent
The length limit for downloaded content, in bytes.
-
maxDuration
protected int maxDuration
The time limit to download the entire content, in seconds.
-
partialAsTruncated
protected boolean partialAsTruncated
Whether to save partial fetches as truncated content.
-
userAgent
protected String userAgent
The Nutch 'User-Agent' request header
-
acceptLanguage
protected String acceptLanguage
The "Accept-Language" request header value.
-
acceptCharset
protected String acceptCharset
The "Accept-Charset" request header value.
-
accept
protected String accept
The "Accept" request header value.
-
useHttp11
protected boolean useHttp11
Do we use HTTP/1.1?
-
useHttp2
protected boolean useHttp2
Whether to use HTTP/2
-
responseTime
protected boolean responseTime
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
-
storeIPAddress
protected boolean storeIPAddress
Record the IP address of the responding server, see propertystore.ip.address.
-
storeHttpRequest
protected boolean storeHttpRequest
Record the HTTP request in the metadata, see propertystore.http.request.
-
storeHttpHeaders
protected boolean storeHttpHeaders
Record the HTTP response header in the metadata, see propertystore.http.headers.
-
maxCrawlDelay
protected long maxCrawlDelay
Skip page if Crawl-Delay longer than this value.
-
tlsCheckCertificate
protected boolean tlsCheckCertificate
Whether to check TLS/SSL certificates
-
tlsPreferredProtocols
protected Set<String> tlsPreferredProtocols
Which TLS/SSL protocols to support
-
tlsPreferredCipherSuites
protected Set<String> tlsPreferredCipherSuites
Which TLS/SSL cipher suites to support
-
enableIfModifiedsinceHeader
protected boolean enableIfModifiedsinceHeader
Configuration directive for If-Modified-Since HTTP header
-
enableCookieHeader
protected boolean enableCookieHeader
Controls whether or not to set Cookie HTTP header based on CrawlDatum metadata
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Description copied from interface:ProtocolGet theProtocolOutputfor a given url and crawldatum- Specified by:
getProtocolOutputin interfaceProtocol- Parameters:
url- canonical urldatum- associatedCrawlDatum- Returns:
- the
ProtocolOutput
-
getProxyHost
public String getProxyHost()
-
getProxyPort
public int getProxyPort()
-
useProxy
public boolean useProxy(URL url)
-
useProxy
public boolean useProxy(URI uri)
-
useProxy
public boolean useProxy(String host)
-
getTimeout
public int getTimeout()
-
isIfModifiedSinceEnabled
public boolean isIfModifiedSinceEnabled()
-
isCookieEnabled
public boolean isCookieEnabled()
-
isStoreIPAddress
public boolean isStoreIPAddress()
-
isStoreHttpRequest
public boolean isStoreHttpRequest()
-
isStoreHttpHeaders
public boolean isStoreHttpHeaders()
-
getMaxContent
public int getMaxContent()
-
getMaxDuration
public int getMaxDuration()
The time limit to download the entire content, in seconds. See the propertyhttp.time.limit.- Returns:
- the maximum duration
-
isStorePartialAsTruncated
public boolean isStorePartialAsTruncated()
Whether to save partial fetches as truncated content, cf. the propertyhttp.partial.truncated.- Returns:
- true if partially fetched truncated content is stored
-
getUserAgent
public String getUserAgent()
-
getCookie
public String getCookie(URL url)
If per-host cookies are configured, this method will look it up for the given url.- Parameters:
url- the url to look-up a cookie for- Returns:
- the cookie or null
-
getAcceptLanguage
public String getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.- Returns:
- The value of the header "Accept-Language" header.
-
getAcceptCharset
public String getAcceptCharset()
-
getAccept
public String getAccept()
-
getUseHttp11
public boolean getUseHttp11()
-
isTlsCheckCertificates
public boolean isTlsCheckCertificates()
-
logConf
protected void logConf()
-
processGzipEncoded
public byte[] processGzipEncoded(byte[] compressed, URL url) throws IOException- Throws:
IOException
-
processDeflateEncoded
public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException- Throws:
IOException
-
getResponse
protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException
- Throws:
ProtocolExceptionIOException
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Description copied from interface:ProtocolRetrieve robot rules applicable for this URL.- Specified by:
getRobotRulesin interfaceProtocol- Parameters:
url- URL to checkdatum- page datumrobotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContentis appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-