Spark Project SQL 3.1.1-hadoop-2.7 API - org.apache.spark.sql.execution.datasources.parquet

final class ParquetDictionary extends Dictionary

class ParquetFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable

class ParquetFilters extends AnyRef

Some utility function to convert Spark data source filters to Parquet filters.

class ParquetOptions extends Serializable

Options for the Parquet data source.

class ParquetOutputWriter extends OutputWriter

class ParquetReadSupport extends ReadSupport[InternalRow] with Logging

A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.

The API interface of ReadSupport is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), ReadSupport need to be instantiated and initialized twice on both driver side and executor side. The init() method is for driver side initialization, while prepareForRead() is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and ReadSupport is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.

Due to this reason, we no longer rely on ReadContext to pass requested schema from init() to prepareForRead(), but use a private var for simplicity.

class ParquetToSparkSchemaConverter extends AnyRef

This converter class is used to convert Parquet MessageType to Spark SQL StructType.

Parquet format backwards-compatibility rules are respected when converting Parquet MessageType schemas.

See also: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging

A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages.

A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages. This class can write Parquet data in two modes:

Standard mode: Parquet data are written in standard format defined in parquet-format spec.
Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.

This behavior can be controlled by SQL option spark.sql.parquet.writeLegacyFormat. The value of this option is propagated to this class by the init() method and its Hadoop configuration argument.

class SparkToParquetSchemaConverter extends AnyRef

This converter class is used to convert Spark SQL StructType to Parquet MessageType.

abstract class SpecificParquetRecordReaderBase[T] extends RecordReader[Void, T]

Base class for custom RecordReaders for Parquet that directly materialize to T.

Base class for custom RecordReaders for Parquet that directly materialize to T. This class handles computing row groups, filtering on them, setting up the column readers, etc. This is heavily based on parquet-mr's RecordReader. TODO: move this to the parquet-mr project. There are performance benefits of doing it this way, albeit at a higher cost to implement. This base class is reusable.

class VectorizedColumnReader extends AnyRef

Decoder to return values from a single column.

class VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase[AnyRef]

A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs.

A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs. This is somewhat based on parquet-mr's ColumnReader.

TODO: handle complex types, decimal requiring more than 8 bytes, INT96. Schema mismatch. All of these can be handled efficiently and easily with codegen.

This class can either return InternalRows or ColumnarBatches. With whole stage codegen enabled, this class returns ColumnarBatches which offers significant performance gains. TODO: make this always return ColumnarBatches.

class VectorizedPlainValuesReader extends ValuesReader with VectorizedValuesReader

An implementation of the Parquet PLAIN decoder that supports the vectorized interface.

final class VectorizedRleValuesReader extends ValuesReader with VectorizedValuesReader

A values reader for Parquet's run-length encoded data.

A values reader for Parquet's run-length encoded data. This is based off of the version in parquet-mr with these changes:

Supports the vectorized interface.
Works on byte arrays(byte[]) instead of making byte streams.

This encoding is used in multiple places:

Definition/Repetition levels
Dictionary ids.

trait VectorizedValuesReader extends AnyRef

Interface for value decoding that supports vectorized (aka batched) decoding.

Interface for value decoding that supports vectorized (aka batched) decoding. TODO: merge this into parquet-mr.

Packages

parquet

package parquet

Type Members

Value Members

Ungrouped