Package org.apache.nutch.tools.arc
Class ArcRecordReader
- java.lang.Object
-
- org.apache.hadoop.mapreduce.RecordReader<Text,BytesWritable>
-
- org.apache.nutch.tools.arc.ArcRecordReader
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class ArcRecordReader extends RecordReader<Text,BytesWritable>
TheArchRecordReaderclass provides a record reader which reads records from arc files. Arc files are essentially tars of gzips. Each record in an arc file is a compressed gzip. Multiple records are concatenated together to form a complete arc. For more information on the arc file format- See Also:
- ArcFileFormat. Arc files are used by the Internet Archive and grub projects., archive.org, grub.org
-
-
Field Summary
Fields Modifier and Type Field Description protected Configurationconfprotected longfileLenprotected FSDataInputStreaminprotected longposprotected longsplitEndprotected longsplitLenprotected longsplitStart
-
Constructor Summary
Constructors Constructor Description ArcRecordReader(Configuration conf, FileSplit split)Constructor that sets the configuration and file split.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()Closes the record reader resources.TextcreateKey()Creates a new instance of theTextobject for the key.BytesWritablecreateValue()Creates a new instance of theBytesWritableobject for the keyTextgetCurrentKey()BytesWritablegetCurrentValue()longgetPos()Returns the current position in the file.floatgetProgress()Returns the percentage of progress in processing the file.voidinitialize(InputSplit split, TaskAttemptContext context)static booleanisMagic(byte[] input)Returns true if the byte array passed matches the gzip header magic number.booleannext(Text key, BytesWritable value)Returns true if the next record in the split is read into the key and value pair.booleannextKeyValue()
-
-
-
Field Detail
-
conf
protected Configuration conf
-
splitStart
protected long splitStart
-
pos
protected long pos
-
splitEnd
protected long splitEnd
-
splitLen
protected long splitLen
-
fileLen
protected long fileLen
-
in
protected FSDataInputStream in
-
-
Constructor Detail
-
ArcRecordReader
public ArcRecordReader(Configuration conf, FileSplit split) throws IOException
Constructor that sets the configuration and file split.- Parameters:
conf- The job configuration.split- The file split to read from.- Throws:
IOException- If an IO error occurs while initializing file split.
-
-
Method Detail
-
isMagic
public static boolean isMagic(byte[] input)
Returns true if the byte array passed matches the gzip header magic number.
- Parameters:
input- The byte array to check.- Returns:
- True if the byte array matches the gzip header magic number.
-
close
public void close() throws IOExceptionCloses the record reader resources.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein classRecordReader<Text,BytesWritable>- Throws:
IOException
-
createKey
public Text createKey()
Creates a new instance of theTextobject for the key.- Returns:
Text
-
createValue
public BytesWritable createValue()
Creates a new instance of theBytesWritableobject for the key- Returns:
BytesWritable
-
getPos
public long getPos() throws IOExceptionReturns the current position in the file.- Returns:
- The long of the current position in the file.
- Throws:
IOException- if there is a fatal I/O error reading the position within theFSDataInputStream
-
getProgress
public float getProgress() throws IOExceptionReturns the percentage of progress in processing the file. This will be represented as a float from 0 to 1 with 1 being 100% completed.- Specified by:
getProgressin classRecordReader<Text,BytesWritable>- Returns:
- The percentage of progress as a float from 0 to 1.
- Throws:
IOException
-
getCurrentValue
public BytesWritable getCurrentValue()
- Specified by:
getCurrentValuein classRecordReader<Text,BytesWritable>
-
getCurrentKey
public Text getCurrentKey()
- Specified by:
getCurrentKeyin classRecordReader<Text,BytesWritable>
-
nextKeyValue
public boolean nextKeyValue()
- Specified by:
nextKeyValuein classRecordReader<Text,BytesWritable>
-
initialize
public void initialize(InputSplit split, TaskAttemptContext context)
- Specified by:
initializein classRecordReader<Text,BytesWritable>
-
next
public boolean next(Text key, BytesWritable value) throws IOException
Returns true if the next record in the split is read into the key and value pair. The key will be the arc record header and the values will be the raw content bytes of the arc record.
- Parameters:
key- The record keyvalue- The record value- Returns:
- True if the next record is read.
- Throws:
IOException- If an error occurs while reading the record value.
-
-