package parquet
- Alphabetic
- Public
- All
Type Members
- final class ParquetDictionary extends Dictionary
- class ParquetFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable
-
class
ParquetFilters extends AnyRef
Some utility function to convert Spark data source filters to Parquet filters.
-
class
ParquetOptions extends Serializable
Options for the Parquet data source.
- class ParquetOutputWriter extends OutputWriter
-
class
ParquetReadSupport extends ReadSupport[InternalRow] with Logging
A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.
A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.
The API interface of ReadSupport is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), ReadSupport need to be instantiated and initialized twice on both driver side and executor side. The init() method is for driver side initialization, while prepareForRead() is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and ReadSupport is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.
Due to this reason, we no longer rely on ReadContext to pass requested schema from init() to prepareForRead(), but use a private
var
for simplicity. -
class
ParquetToSparkSchemaConverter extends AnyRef
This converter class is used to convert Parquet MessageType to Spark SQL StructType.
This converter class is used to convert Parquet MessageType to Spark SQL StructType.
Parquet format backwards-compatibility rules are respected when converting Parquet MessageType schemas.
- See also
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
-
class
ParquetWriteSupport extends WriteSupport[InternalRow] with Logging
A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages.
A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages. This class can write Parquet data in two modes:
- Standard mode: Parquet data are written in standard format defined in parquet-format spec.
- Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
This behavior can be controlled by SQL option
spark.sql.parquet.writeLegacyFormat
. The value of this option is propagated to this class by theinit()
method and its Hadoop configuration argument. -
class
SparkToParquetSchemaConverter extends AnyRef
This converter class is used to convert Spark SQL StructType to Parquet MessageType.
-
abstract
class
SpecificParquetRecordReaderBase[T] extends RecordReader[Void, T]
Base class for custom RecordReaders for Parquet that directly materialize to
T
.Base class for custom RecordReaders for Parquet that directly materialize to
T
. This class handles computing row groups, filtering on them, setting up the column readers, etc. This is heavily based on parquet-mr's RecordReader. TODO: move this to the parquet-mr project. There are performance benefits of doing it this way, albeit at a higher cost to implement. This base class is reusable. -
class
VectorizedColumnReader extends AnyRef
Decoder to return values from a single column.
-
class
VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase[AnyRef]
A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs.
A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs. This is somewhat based on parquet-mr's ColumnReader.
TODO: handle complex types, decimal requiring more than 8 bytes, INT96. Schema mismatch. All of these can be handled efficiently and easily with codegen.
This class can either return InternalRows or ColumnarBatches. With whole stage codegen enabled, this class returns ColumnarBatches which offers significant performance gains. TODO: make this always return ColumnarBatches.
-
class
VectorizedPlainValuesReader extends ValuesReader with VectorizedValuesReader
An implementation of the Parquet PLAIN decoder that supports the vectorized interface.
-
final
class
VectorizedRleValuesReader extends ValuesReader with VectorizedValuesReader
A values reader for Parquet's run-length encoded data.
A values reader for Parquet's run-length encoded data. This is based off of the version in parquet-mr with these changes:
- Supports the vectorized interface.
- Works on byte arrays(byte[]) instead of making byte streams.
This encoding is used in multiple places:
- Definition/Repetition levels
- Dictionary ids.
-
trait
VectorizedValuesReader extends AnyRef
Interface for value decoding that supports vectorized (aka batched) decoding.
Interface for value decoding that supports vectorized (aka batched) decoding. TODO: merge this into parquet-mr.
Value Members
- object ParquetFileFormat extends Logging with Serializable
- object ParquetOptions extends Serializable
- object ParquetReadSupport
- object ParquetUtils
- object ParquetWriteSupport