CompressedSplitLineReader (hadoop-mapreduce-client-core 2.7.4 API)

java.lang.Object
- org.apache.hadoop.util.LineReader
- - org.apache.hadoop.mapreduce.lib.input.SplitLineReader
  - - org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader

All Implemented Interfaces:

Closeable, AutoCloseable
```
@InterfaceAudience.Private
 @InterfaceStability.Unstable
public class CompressedSplitLineReader
extends SplitLineReader
```
Line reader for compressed splits Reading records from a compressed split is tricky, as the LineRecordReader is using the reported compressed input stream position directly to determine when a split has ended. In addition the compressed input stream is usually faking the actual byte position, often updating it only after the first compressed block after the split is accessed. Depending upon where the last compressed block of the split ends relative to the record delimiters it can be easy to accidentally drop the last record or duplicate the last record between this split and the next. Split end scenarios: 1) Last block of split ends in the middle of a record Nothing special that needs to be done here, since the compressed input stream will report a position after the split end once the record is fully read. The consumer of the next split will discard the partial record at the start of the split normally, and no data is lost or duplicated between the splits. 2) Last block of split ends in the middle of a delimiter The line reader will continue to consume bytes into the next block to locate the end of the delimiter. If a custom delimiter is being used then the next record must be read by this split or it will be dropped. The consumer of the next split will not recognize the partial delimiter at the beginning of its split and will discard it along with the next record. However for the default delimiter processing there is a special case because CR, LF, and CRLF are all valid record delimiters. If the block ends with a CR then the reader must peek at the next byte to see if it is an LF and therefore part of the same record delimiter. Peeking at the next byte is an access to the next block and triggers the stream to report the end of the split. There are two cases based on the next byte: A) The next byte is LF The split needs to end after the current record is returned. The consumer of the next split will discard the first record, which is degenerate since LF is itself a delimiter, and start consuming records after that byte. If the current split tries to read another record then the record will be duplicated between splits. B) The next byte is not LF The current record will be returned but the stream will report the split has ended due to the peek into the next block. If the next record is not read then it will be lost, as the consumer of the next split will discard it before processing subsequent records. Therefore the next record beyond the reported split end must be consumed by this split to avoid data loss. 3) Last block of split ends at the beginning of a delimiter This is equivalent to case 1, as the reader will consume bytes into the next block and trigger the end of the split. No further records should be read as the consumer of the next split will discard the (degenerate) record at the beginning of its split. 4) Last block of split ends at the end of a delimiter Nothing special needs to be done here. The reader will not start examining the bytes into the next block until the next record is read, so the stream will not report the end of the split just yet. Once the next record is read then the next block will be accessed and the stream will indicate the end of the split. The consumer of the next split will correctly discard the first record of its split, and no data is lost or duplicated. If the default delimiter is used and the block ends at a CR then this is treated as case 2 since the reader does not yet know without looking at subsequent bytes whether the delimiter has ended. NOTE: It is assumed that compressed input streams *never* return bytes from multiple compressed blocks from a single read. Failure to do so will violate the buffering performed by this class, as it will access bytes into the next block after the split before returning all of the records from the previous block.

Constructor Summary

Constructors
Constructor and Description
`CompressedSplitLineReader(org.apache.hadoop.io.compress.SplitCompressionInputStream in, org.apache.hadoop.conf.Configuration conf, byte[] recordDelimiterBytes)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected int`	`fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)`
`boolean`	`needAdditionalRecordAfterSplit()`
`int`	`readLine(org.apache.hadoop.io.Text str, int maxLineLength, int maxBytesToConsume)`
`protected void`	`unsetNeedAdditionalRecordAfterSplit()`

Methods inherited from class org.apache.hadoop.util.LineReader
close, getBufferPosn, getBufferSize, readLine, readLine

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

CompressedSplitLineReader

public CompressedSplitLineReader(org.apache.hadoop.io.compress.SplitCompressionInputStream in,
                                 org.apache.hadoop.conf.Configuration conf,
                                 byte[] recordDelimiterBytes)
                          throws IOException

Throws:: IOException

Method Detail

fillBuffer

protected int fillBuffer(InputStream in,
                         byte[] buffer,
                         boolean inDelimiter)
                  throws IOException

Overrides:: fillBuffer in class org.apache.hadoop.util.LineReader
Throws:: IOException

readLine

public int readLine(org.apache.hadoop.io.Text str,
                    int maxLineLength,
                    int maxBytesToConsume)
             throws IOException

Overrides:: readLine in class org.apache.hadoop.util.LineReader
Throws:: IOException

needAdditionalRecordAfterSplit
```
public boolean needAdditionalRecordAfterSplit()
```
Overrides:

needAdditionalRecordAfterSplit in class SplitLineReader

unsetNeedAdditionalRecordAfterSplit
```
protected void unsetNeedAdditionalRecordAfterSplit()
```
Overrides:

unsetNeedAdditionalRecordAfterSplit in class org.apache.hadoop.util.LineReader

Class CompressedSplitLineReader

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.util.LineReader

Methods inherited from class java.lang.Object

Constructor Detail

CompressedSplitLineReader

Method Detail

fillBuffer

readLine

needAdditionalRecordAfterSplit

unsetNeedAdditionalRecordAfterSplit