Class ParquetWriter.Builder<T,​SELF extends ParquetWriter.Builder<T,​SELF>>

  • Type Parameters:
    T - The type of objects written by the constructed ParquetWriter.
    SELF - The type of this builder that is returned by builder methods
    Direct Known Subclasses:
    ExampleParquetWriter.Builder
    Enclosing class:
    ParquetWriter<T>

    public abstract static class ParquetWriter.Builder<T,​SELF extends ParquetWriter.Builder<T,​SELF>>
    extends Object
    An abstract builder class for ParquetWriter instances. Object models should extend this builder to provide writer configuration options.
    • Constructor Detail

      • Builder

        protected Builder​(org.apache.hadoop.fs.Path path)
      • Builder

        protected Builder​(org.apache.parquet.io.OutputFile path)
    • Method Detail

      • self

        protected abstract SELF self()
        Returns:
        this as the correct subclass of ParquetWriter.Builder.
      • getWriteSupport

        protected abstract WriteSupport<T> getWriteSupport​(org.apache.hadoop.conf.Configuration conf)
        Parameters:
        conf - a configuration
        Returns:
        an appropriate WriteSupport for the object model.
      • withConf

        public SELF withConf​(org.apache.hadoop.conf.Configuration conf)
        Set the Configuration used by the constructed writer.
        Parameters:
        conf - a Configuration
        Returns:
        this builder for method chaining.
      • withWriteMode

        public SELF withWriteMode​(ParquetFileWriter.Mode mode)
        Set the write mode used when creating the backing file for this writer.
        Parameters:
        mode - a ParquetFileWriter.Mode
        Returns:
        this builder for method chaining.
      • withCompressionCodec

        public SELF withCompressionCodec​(org.apache.parquet.hadoop.metadata.CompressionCodecName codecName)
        Set the compression codec used by the constructed writer.
        Parameters:
        codecName - a CompressionCodecName
        Returns:
        this builder for method chaining.
      • withRowGroupSize

        @Deprecated
        public SELF withRowGroupSize​(int rowGroupSize)
        Deprecated.
        Set the Parquet format row group size used by the constructed writer.
        Parameters:
        rowGroupSize - an integer size in bytes
        Returns:
        this builder for method chaining.
      • withRowGroupSize

        public SELF withRowGroupSize​(long rowGroupSize)
        Set the Parquet format row group size used by the constructed writer.
        Parameters:
        rowGroupSize - an integer size in bytes
        Returns:
        this builder for method chaining.
      • withPageSize

        public SELF withPageSize​(int pageSize)
        Set the Parquet format page size used by the constructed writer.
        Parameters:
        pageSize - an integer size in bytes
        Returns:
        this builder for method chaining.
      • withPageRowCountLimit

        public SELF withPageRowCountLimit​(int rowCount)
        Sets the Parquet format page row count limit used by the constructed writer.
        Parameters:
        rowCount - limit for the number of rows stored in a page
        Returns:
        this builder for method chaining
      • withDictionaryPageSize

        public SELF withDictionaryPageSize​(int dictionaryPageSize)
        Set the Parquet format dictionary page size used by the constructed writer.
        Parameters:
        dictionaryPageSize - an integer size in bytes
        Returns:
        this builder for method chaining.
      • withMaxPaddingSize

        public SELF withMaxPaddingSize​(int maxPaddingSize)
        Set the maximum amount of padding, in bytes, that will be used to align row groups with blocks in the underlying filesystem. If the underlying filesystem is not a block filesystem like HDFS, this has no effect.
        Parameters:
        maxPaddingSize - an integer size in bytes
        Returns:
        this builder for method chaining.
      • enableDictionaryEncoding

        public SELF enableDictionaryEncoding()
        Enables dictionary encoding for the constructed writer.
        Returns:
        this builder for method chaining.
      • withDictionaryEncoding

        public SELF withDictionaryEncoding​(boolean enableDictionary)
        Enable or disable dictionary encoding for the constructed writer.
        Parameters:
        enableDictionary - whether dictionary encoding should be enabled
        Returns:
        this builder for method chaining.
      • withByteStreamSplitEncoding

        public SELF withByteStreamSplitEncoding​(boolean enableByteStreamSplit)
      • withDictionaryEncoding

        public SELF withDictionaryEncoding​(String columnPath,
                                           boolean enableDictionary)
        Enable or disable dictionary encoding of the specified column for the constructed writer.
        Parameters:
        columnPath - the path of the column (dot-string)
        enableDictionary - whether dictionary encoding should be enabled
        Returns:
        this builder for method chaining.
      • enableValidation

        public SELF enableValidation()
        Enables validation for the constructed writer.
        Returns:
        this builder for method chaining.
      • withValidation

        public SELF withValidation​(boolean enableValidation)
        Enable or disable validation for the constructed writer.
        Parameters:
        enableValidation - whether validation should be enabled
        Returns:
        this builder for method chaining.
      • withWriterVersion

        public SELF withWriterVersion​(org.apache.parquet.column.ParquetProperties.WriterVersion version)
        Set the format version used by the constructed writer.
        Parameters:
        version - a WriterVersion
        Returns:
        this builder for method chaining.
      • enablePageWriteChecksum

        public SELF enablePageWriteChecksum()
        Enables writing page level checksums for the constructed writer.
        Returns:
        this builder for method chaining.
      • withPageWriteChecksumEnabled

        public SELF withPageWriteChecksumEnabled​(boolean enablePageWriteChecksum)
        Enables writing page level checksums for the constructed writer.
        Parameters:
        enablePageWriteChecksum - whether page checksums should be written out
        Returns:
        this builder for method chaining.
      • withBloomFilterNDV

        public SELF withBloomFilterNDV​(String columnPath,
                                       long ndv)
        Sets the NDV (number of distinct values) for the specified column.
        Parameters:
        columnPath - the path of the column (dot-string)
        ndv - the NDV of the column
        Returns:
        this builder for method chaining.
      • withBloomFilterFPP

        public SELF withBloomFilterFPP​(String columnPath,
                                       double fpp)
      • withBloomFilterEnabled

        public SELF withBloomFilterEnabled​(boolean enabled)
        Sets the bloom filter enabled/disabled
        Parameters:
        enabled - whether to write bloom filters
        Returns:
        this builder for method chaining
      • withBloomFilterEnabled

        public SELF withBloomFilterEnabled​(String columnPath,
                                           boolean enabled)
        Sets the bloom filter enabled/disabled for the specified column. If not set for the column specifically the default enabled/disabled state will take place. See withBloomFilterEnabled(boolean).
        Parameters:
        columnPath - the path of the column (dot-string)
        enabled - whether to write bloom filter for the column
        Returns:
        this builder for method chaining
      • withMinRowCountForPageSizeCheck

        public SELF withMinRowCountForPageSizeCheck​(int min)
        Sets the minimum number of rows to write before a page size check is done.
        Parameters:
        min - writes at least `min` rows before invoking a page size check
        Returns:
        this builder for method chaining
      • withMaxRowCountForPageSizeCheck

        public SELF withMaxRowCountForPageSizeCheck​(int max)
        Sets the maximum number of rows to write before a page size check is done.
        Parameters:
        max - makes a page size check after `max` rows have been written
        Returns:
        this builder for method chaining
      • withColumnIndexTruncateLength

        public SELF withColumnIndexTruncateLength​(int length)
        Sets the length to be used for truncating binary values in a binary column index.
        Parameters:
        length - the length to truncate to
        Returns:
        this builder for method chaining
      • withStatisticsTruncateLength

        public SELF withStatisticsTruncateLength​(int length)
        Sets the length which the min/max binary values in row groups are truncated to.
        Parameters:
        length - the length to truncate to
        Returns:
        this builder for method chaining
      • config

        public SELF config​(String property,
                           String value)
        Set a property that will be available to the read path. For writers that use a Hadoop configuration, this is the recommended way to add configuration values.
        Parameters:
        property - a String property name
        value - a String property value
        Returns:
        this builder for method chaining.