Package org.apache.nutch.tools
Class AbstractCommonCrawlFormat
- java.lang.Object
-
- org.apache.nutch.tools.AbstractCommonCrawlFormat
-
- All Implemented Interfaces:
Closeable,AutoCloseable,CommonCrawlFormat
- Direct Known Subclasses:
CommonCrawlFormatJackson,CommonCrawlFormatJettinson,CommonCrawlFormatSimple,CommonCrawlFormatWARC
public abstract class AbstractCommonCrawlFormat extends Object implements CommonCrawlFormat
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
-
-
Field Summary
Fields Modifier and Type Field Description protected Configurationconfprotected Contentcontentprotected List<String>inLinksprotected booleanjsonArrayprotected StringkeyPrefixprotected static org.slf4j.LoggerLOGprotected Metadatametadataprotected booleanreverseKeyprotected StringreverseKeyValueprotected booleansimpleDateFormatprotected Stringurl
-
Constructor Summary
Constructors Constructor Description AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description voidclose()Optional method that could be implemented if the actual format needs some close procedure.protected abstract voidcloseArray(String key, boolean nested, boolean newline)protected abstract voidcloseObject(String key)protected abstract StringgenerateJson()protected StringgetImported()List<String>getInLinks()gets set of inlinksStringgetJsonData()Get a string representation of the JSON structure of the URL content.StringgetJsonData(String url, Content content, Metadata metadata)Returns a string representation of the JSON structure of the URL content.StringgetJsonData(String url, Content content, Metadata metadata, ParseData parseData)Returns a string representation of the JSON structure of the URL content.protected StringgetKey()protected StringgetMethod()protected StringgetRequestAccept()protected StringgetRequestAcceptEncoding()protected StringgetRequestAcceptLanguage()protected StringgetRequestContactEmail()protected StringgetRequestContactName()protected StringgetRequestHostAddress()protected StringgetRequestHostName()protected StringgetRequestRobots()protected StringgetRequestSoftware()protected StringgetRequestUserAgent()protected StringgetResponseAddress()protected StringgetResponseContent()protected StringgetResponseContentEncoding()protected StringgetResponseContentType()protected StringgetResponseDate()protected StringgetResponseHostName()protected StringgetResponseServer()protected StringgetResponseStatus()protected StringgetTimestamp()protected StringgetUrl()voidsetInLinks(List<String> inLinks)sets inlinks of this documentprotected abstract voidstartArray(String key, boolean nested, boolean newline)protected abstract voidstartObject(String key)protected abstract voidwriteArrayValue(String value)protected abstract voidwriteKeyNull(String key)protected abstract voidwriteKeyValue(String key, String value)
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
url
protected String url
-
content
protected Content content
-
metadata
protected Metadata metadata
-
conf
protected Configuration conf
-
keyPrefix
protected String keyPrefix
-
simpleDateFormat
protected boolean simpleDateFormat
-
jsonArray
protected boolean jsonArray
-
reverseKey
protected boolean reverseKey
-
reverseKeyValue
protected String reverseKeyValue
-
-
Constructor Detail
-
AbstractCommonCrawlFormat
public AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config) throws IOException
- Throws:
IOException
-
-
Method Detail
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata) throws IOException
Description copied from interface:CommonCrawlFormatReturns a string representation of the JSON structure of the URL content. Takes into consideration both theContentandMetadata- Specified by:
getJsonDatain interfaceCommonCrawlFormat- Parameters:
url- the canonical urlcontent- urlContentmetadata- urlMetadata- Returns:
- the JSON URL content string
- Throws:
IOException- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata, ParseData parseData) throws IOException
Description copied from interface:CommonCrawlFormatReturns a string representation of the JSON structure of the URL content. Takes into consideration theContent,MetadataandParseData.- Specified by:
getJsonDatain interfaceCommonCrawlFormat- Parameters:
url- the canonical urlcontent- urlContentmetadata- urlMetadataparseData- urlParseData- Returns:
- the JSON URL content string
- Throws:
IOException- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData() throws IOException
Description copied from interface:CommonCrawlFormatGet a string representation of the JSON structure of the URL content.- Specified by:
getJsonDatain interfaceCommonCrawlFormat- Returns:
- the JSON URL content string
- Throws:
IOException- if there is a fatal I/O error obtaining JSON data
-
writeKeyValue
protected abstract void writeKeyValue(String key, String value) throws IOException
- Throws:
IOException
-
writeKeyNull
protected abstract void writeKeyNull(String key) throws IOException
- Throws:
IOException
-
startArray
protected abstract void startArray(String key, boolean nested, boolean newline) throws IOException
- Throws:
IOException
-
closeArray
protected abstract void closeArray(String key, boolean nested, boolean newline) throws IOException
- Throws:
IOException
-
writeArrayValue
protected abstract void writeArrayValue(String value) throws IOException
- Throws:
IOException
-
startObject
protected abstract void startObject(String key) throws IOException
- Throws:
IOException
-
closeObject
protected abstract void closeObject(String key) throws IOException
- Throws:
IOException
-
generateJson
protected abstract String generateJson() throws IOException
- Throws:
IOException
-
getUrl
protected String getUrl()
-
getTimestamp
protected String getTimestamp()
-
getMethod
protected String getMethod()
-
getRequestHostName
protected String getRequestHostName()
-
getRequestHostAddress
protected String getRequestHostAddress()
-
getRequestSoftware
protected String getRequestSoftware()
-
getRequestRobots
protected String getRequestRobots()
-
getRequestContactName
protected String getRequestContactName()
-
getRequestContactEmail
protected String getRequestContactEmail()
-
getRequestAccept
protected String getRequestAccept()
-
getRequestAcceptEncoding
protected String getRequestAcceptEncoding()
-
getRequestAcceptLanguage
protected String getRequestAcceptLanguage()
-
getRequestUserAgent
protected String getRequestUserAgent()
-
getResponseStatus
protected String getResponseStatus()
-
getResponseHostName
protected String getResponseHostName()
-
getResponseAddress
protected String getResponseAddress()
-
getResponseContentEncoding
protected String getResponseContentEncoding()
-
getResponseContentType
protected String getResponseContentType()
-
getInLinks
public List<String> getInLinks()
Description copied from interface:CommonCrawlFormatgets set of inlinks- Specified by:
getInLinksin interfaceCommonCrawlFormat- Returns:
- gets inlinks of this document
-
setInLinks
public void setInLinks(List<String> inLinks)
Description copied from interface:CommonCrawlFormatsets inlinks of this document- Specified by:
setInLinksin interfaceCommonCrawlFormat- Parameters:
inLinks- list of inlinks
-
getResponseDate
protected String getResponseDate()
-
getResponseServer
protected String getResponseServer()
-
getResponseContent
protected String getResponseContent()
-
getKey
protected String getKey()
-
getImported
protected String getImported()
-
close
public void close()
Description copied from interface:CommonCrawlFormatOptional method that could be implemented if the actual format needs some close procedure.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein interfaceCommonCrawlFormat
-
-