Close the file after reading it.
Close the file after reading it.
Get the current Key.
Get the current Value.
Get the current Value.
(Seq[Row]) Value is a list of heterogeneous lists. It will be converted to List[Row] later.
Fancy way of getting a process bar.
Fancy way of getting a process bar. Useful to know whether you have time for a coffee and a cigarette before the next run.
(Float) progression inside a block.
Here an executor will come and ask for a block of data by calling initialize().
Here an executor will come and ask for a block of data by calling initialize(). Hadoop will split the data into records and those records will be sent. One needs then to know: the data file, the starting index of a split (byte index), the size of one record of data (byte), the ending index of a split (byte).
Typically, a record must not be bigger than 1MB for the process to be efficient. Otherwise you will have a lot of Garbage collector call!
: (InputSplit) Represents the data to be processed by an individual Mapper.
: (TaskAttemptContext) Currently active context to access contextual information about running tasks.
(Long) the current position of the pointer cursor in the file.
Here you describe how the records are made, and the split data sent.
Here you describe how the records are made, and the split data sent.
(Boolean) true if the Mapper did not reach the end of the split. false otherwise.
Class to handle the relationship between executors & HDFS when reading a FITS file: File -> InputSplit -> RecordReader (this class) -> Mapper (executors) It extends the abstract class RecordReader from Hadoop. The idea behind is to describe the split of the FITS file in block and splits in HDFS. First the file is split into blocks in HDFS (physical blocks), whose size are given by Hadoop configuration (typically 128 MB). Then inside a block, the data is sent to executors record-by-record (logical split) of size < 128 MB. The purpose of this class is to describe the 2nd step, that is the split of blocks in records.
The data is first read in chunks of binary data, then converted to the correct type element by element, and finally grouped into rows.