Class ParquetFileWriter

  • public class ParquetFileWriter
    extends Object
    Internal implementation of the Parquet file writer as a block container
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  ParquetFileWriter.Mode  
    • Constructor Summary

      Constructor Description
      ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file)
      will be removed in 2.0.0
      ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode)
      will be removed in 2.0.0
      ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
      will be removed in 2.0.0
      ParquetFileWriter​( file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
      will be removed in 2.0.0
      ParquetFileWriter​( file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled)  
      ParquetFileWriter​( file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, FileEncryptionProperties encryptionProperties)  
      ParquetFileWriter​( file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, InternalFileEncryptor encryptor)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      void appendColumnChunk​(org.apache.parquet.column.ColumnDescriptor descriptor, from, ColumnChunkMetaData chunk, org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter, org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex, org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex)  
      void appendFile​(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file)
      will be removed in 2.0.0; use appendFile(InputFile) instead
      void appendFile​( file)  
      void appendRowGroup​(org.apache.hadoop.fs.FSDataInputStream from, BlockMetaData rowGroup, boolean dropColumns)
      will be removed in 2.0.0; use appendRowGroup(SeekableInputStream,BlockMetaData,boolean) instead
      void appendRowGroup​( from, BlockMetaData rowGroup, boolean dropColumns)  
      void appendRowGroups​(org.apache.hadoop.fs.FSDataInputStream file, List<BlockMetaData> rowGroups, boolean dropColumns)
      will be removed in 2.0.0; use appendRowGroups(SeekableInputStream,List,boolean) instead
      void appendRowGroups​( file, List<BlockMetaData> rowGroups, boolean dropColumns)  
      void end​(Map<String,​String> extraMetaData)
      ends a file once all blocks have been written.
      void endBlock()
      ends a block once all column chunks have been written
      void endColumn()
      end a column (once all rep, def and data have been written)
      InternalFileEncryptor getEncryptor()  
      ParquetMetadata getFooter()  
      long getNextRowGroupSize()  
      long getPos()  
      static ParquetMetadata mergeMetadataFiles​(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf)
      metadata files are not recommended and will be removed in 2.0.0
      static ParquetMetadata mergeMetadataFiles​(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy)
      metadata files are not recommended and will be removed in 2.0.0
      void start()
      start the file
      void startBlock​(long recordCount)
      start a block
      void startColumn​(org.apache.parquet.column.ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName)
      start a column inside a block
      void writeDataPage​(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
      void writeDataPage​(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
      Writes a single page
      void writeDataPage​(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding, org.apache.parquet.format.BlockCipher.Encryptor metadataBlockEncryptor, byte[] pageHeaderAAD)
      Writes a single page
      void writeDataPage​(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
      this method does not support writing column indexes; Use writeDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding) instead
      void writeDataPage​(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding, org.apache.parquet.format.BlockCipher.Encryptor metadataBlockEncryptor, byte[] pageHeaderAAD)
      writes a single page
      void writeDataPageV2​(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, org.apache.parquet.column.statistics.Statistics<?> statistics)
      Writes a single v2 data page
      void writeDictionaryPage​( dictionaryPage)
      writes a dictionary page page
      void writeDictionaryPage​( dictionaryPage, org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD)  
      static void writeMergedMetadataFile​(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf)
      metadata files are not recommended and will be removed in 2.0.0
      static void writeMetadataFile​(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers)
      metadata files are not recommended and will be removed in 2.0.0
      static void writeMetadataFile​(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<Footer> footers, ParquetOutputFormat.JobSummaryLevel level)
      metadata files are not recommended and will be removed in 2.0.0
    • Constructor Detail

      • ParquetFileWriter

        public ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration,
                                 org.apache.parquet.schema.MessageType schema,
                                 org.apache.hadoop.fs.Path file)
                          throws IOException
        will be removed in 2.0.0
        configuration - Hadoop configuration
        schema - the schema of the data
        file - the file to write to
        IOException - if the file can not be created
      • ParquetFileWriter

        public ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration,
                                 org.apache.parquet.schema.MessageType schema,
                                 org.apache.hadoop.fs.Path file,
                                 ParquetFileWriter.Mode mode)
                          throws IOException
        will be removed in 2.0.0
        configuration - Hadoop configuration
        schema - the schema of the data
        file - the file to write to
        mode - file creation mode
        IOException - if the file can not be created
      • ParquetFileWriter

        public ParquetFileWriter​(org.apache.hadoop.conf.Configuration configuration,
                                 org.apache.parquet.schema.MessageType schema,
                                 org.apache.hadoop.fs.Path file,
                                 ParquetFileWriter.Mode mode,
                                 long rowGroupSize,
                                 int maxPaddingSize)
                          throws IOException
        will be removed in 2.0.0
        configuration - Hadoop configuration
        schema - the schema of the data
        file - the file to write to
        mode - file creation mode
        rowGroupSize - the row group size
        maxPaddingSize - the maximum padding
        IOException - if the file can not be created
      • ParquetFileWriter

        public ParquetFileWriter​( file,
                                 org.apache.parquet.schema.MessageType schema,
                                 ParquetFileWriter.Mode mode,
                                 long rowGroupSize,
                                 int maxPaddingSize)
                          throws IOException
        will be removed in 2.0.0
        file - OutputFile to create or overwrite
        schema - the schema of the data
        mode - file creation mode
        rowGroupSize - the row group size
        maxPaddingSize - the maximum padding
        IOException - if the file can not be created
      • ParquetFileWriter

        public ParquetFileWriter​( file,
                                 org.apache.parquet.schema.MessageType schema,
                                 ParquetFileWriter.Mode mode,
                                 long rowGroupSize,
                                 int maxPaddingSize,
                                 int columnIndexTruncateLength,
                                 int statisticsTruncateLength,
                                 boolean pageWriteChecksumEnabled)
                          throws IOException
        file - OutputFile to create or overwrite
        schema - the schema of the data
        mode - file creation mode
        rowGroupSize - the row group size
        maxPaddingSize - the maximum padding
        columnIndexTruncateLength - the length which the min/max values in column indexes tried to be truncated to
        statisticsTruncateLength - the length which the min/max values in row groups tried to be truncated to
        pageWriteChecksumEnabled - whether to write out page level checksums
        IOException - if the file can not be created
      • ParquetFileWriter

        public ParquetFileWriter​( file,
                                 org.apache.parquet.schema.MessageType schema,
                                 ParquetFileWriter.Mode mode,
                                 long rowGroupSize,
                                 int maxPaddingSize,
                                 int columnIndexTruncateLength,
                                 int statisticsTruncateLength,
                                 boolean pageWriteChecksumEnabled,
                                 FileEncryptionProperties encryptionProperties)
                          throws IOException
      • ParquetFileWriter

        public ParquetFileWriter​( file,
                                 org.apache.parquet.schema.MessageType schema,
                                 ParquetFileWriter.Mode mode,
                                 long rowGroupSize,
                                 int maxPaddingSize,
                                 int columnIndexTruncateLength,
                                 int statisticsTruncateLength,
                                 boolean pageWriteChecksumEnabled,
                                 InternalFileEncryptor encryptor)
                          throws IOException
    • Method Detail

      • start

        public void start()
                   throws IOException
        start the file
        IOException - if there is an error while writing
      • startBlock

        public void startBlock​(long recordCount)
                        throws IOException
        start a block
        recordCount - the record count in this block
        IOException - if there is an error while writing
      • startColumn

        public void startColumn​(org.apache.parquet.column.ColumnDescriptor descriptor,
                                long valueCount,
                                org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName)
                         throws IOException
        start a column inside a block
        descriptor - the column descriptor
        valueCount - the value count in this column
        compressionCodecName - a compression codec name
        IOException - if there is an error while writing
      • writeDictionaryPage

        public void writeDictionaryPage​( dictionaryPage)
                                 throws IOException
        writes a dictionary page page
        dictionaryPage - the dictionary page
        IOException - if there is an error while writing
      • writeDictionaryPage

        public void writeDictionaryPage​( dictionaryPage,
                                        org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor,
                                        byte[] AAD)
                                 throws IOException
      • writeDataPage

        public void writeDataPage​(int valueCount,
                                  int uncompressedPageSize,
                                  org.apache.parquet.bytes.BytesInput bytes,
                                  org.apache.parquet.column.Encoding rlEncoding,
                                  org.apache.parquet.column.Encoding dlEncoding,
                                  org.apache.parquet.column.Encoding valuesEncoding)
                           throws IOException
        writes a single page
        valueCount - count of values
        uncompressedPageSize - the size of the data once uncompressed
        bytes - the compressed data for the page without header
        rlEncoding - encoding of the repetition level
        dlEncoding - encoding of the definition level
        valuesEncoding - encoding of values
        IOException - if there is an error while writing
      • writeDataPage

        public void writeDataPage​(int valueCount,
                                  int uncompressedPageSize,
                                  org.apache.parquet.bytes.BytesInput bytes,
                                  org.apache.parquet.column.statistics.Statistics statistics,
                                  org.apache.parquet.column.Encoding rlEncoding,
                                  org.apache.parquet.column.Encoding dlEncoding,
                                  org.apache.parquet.column.Encoding valuesEncoding)
                           throws IOException
        this method does not support writing column indexes; Use writeDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding) instead
        writes a single page
        valueCount - count of values
        uncompressedPageSize - the size of the data once uncompressed
        bytes - the compressed data for the page without header
        statistics - statistics for the page
        rlEncoding - encoding of the repetition level
        dlEncoding - encoding of the definition level
        valuesEncoding - encoding of values
        IOException - if there is an error while writing
      • writeDataPage

        public void writeDataPage​(int valueCount,
                                  int uncompressedPageSize,
                                  org.apache.parquet.bytes.BytesInput bytes,
                                  org.apache.parquet.column.statistics.Statistics statistics,
                                  long rowCount,
                                  org.apache.parquet.column.Encoding rlEncoding,
                                  org.apache.parquet.column.Encoding dlEncoding,
                                  org.apache.parquet.column.Encoding valuesEncoding)
                           throws IOException
        Writes a single page
        valueCount - count of values
        uncompressedPageSize - the size of the data once uncompressed
        bytes - the compressed data for the page without header
        statistics - the statistics of the page
        rowCount - the number of rows in the page
        rlEncoding - encoding of the repetition level
        dlEncoding - encoding of the definition level
        valuesEncoding - encoding of values
        IOException - if any I/O error occurs during writing the file
      • writeDataPage

        public void writeDataPage​(int valueCount,
                                  int uncompressedPageSize,
                                  org.apache.parquet.bytes.BytesInput bytes,
                                  org.apache.parquet.column.statistics.Statistics statistics,
                                  long rowCount,
                                  org.apache.parquet.column.Encoding rlEncoding,
                                  org.apache.parquet.column.Encoding dlEncoding,
                                  org.apache.parquet.column.Encoding valuesEncoding,
                                  org.apache.parquet.format.BlockCipher.Encryptor metadataBlockEncryptor,
                                  byte[] pageHeaderAAD)
                           throws IOException
        Writes a single page
        valueCount - count of values
        uncompressedPageSize - the size of the data once uncompressed
        bytes - the compressed data for the page without header
        statistics - the statistics of the page
        rowCount - the number of rows in the page
        rlEncoding - encoding of the repetition level
        dlEncoding - encoding of the definition level
        valuesEncoding - encoding of values
        metadataBlockEncryptor - encryptor for block data
        pageHeaderAAD - pageHeader AAD
        IOException - if any I/O error occurs during writing the file
      • writeDataPage

        public void writeDataPage​(int valueCount,
                                  int uncompressedPageSize,
                                  org.apache.parquet.bytes.BytesInput bytes,
                                  org.apache.parquet.column.statistics.Statistics statistics,
                                  org.apache.parquet.column.Encoding rlEncoding,
                                  org.apache.parquet.column.Encoding dlEncoding,
                                  org.apache.parquet.column.Encoding valuesEncoding,
                                  org.apache.parquet.format.BlockCipher.Encryptor metadataBlockEncryptor,
                                  byte[] pageHeaderAAD)
                           throws IOException
        writes a single page
        valueCount - count of values
        uncompressedPageSize - the size of the data once uncompressed
        bytes - the compressed data for the page without header
        statistics - statistics for the page
        rlEncoding - encoding of the repetition level
        dlEncoding - encoding of the definition level
        valuesEncoding - encoding of values
        metadataBlockEncryptor - encryptor for block data
        pageHeaderAAD - pageHeader AAD
        IOException - if there is an error while writing
      • writeDataPageV2

        public void writeDataPageV2​(int rowCount,
                                    int nullCount,
                                    int valueCount,
                                    org.apache.parquet.bytes.BytesInput repetitionLevels,
                                    org.apache.parquet.bytes.BytesInput definitionLevels,
                                    org.apache.parquet.column.Encoding dataEncoding,
                                    org.apache.parquet.bytes.BytesInput compressedData,
                                    int uncompressedDataSize,
                                    org.apache.parquet.column.statistics.Statistics<?> statistics)
                             throws IOException
        Writes a single v2 data page
        rowCount - count of rows
        nullCount - count of nulls
        valueCount - count of values
        repetitionLevels - repetition level bytes
        definitionLevels - definition level bytes
        dataEncoding - encoding for data
        compressedData - compressed data bytes
        uncompressedDataSize - the size of uncompressed data
        statistics - the statistics of the page
        IOException - if any I/O error occurs during writing the file
      • endColumn

        public void endColumn()
                       throws IOException
        end a column (once all rep, def and data have been written)
        IOException - if there is an error while writing
      • endBlock

        public void endBlock()
                      throws IOException
        ends a block once all column chunks have been written
        IOException - if there is an error while writing
      • appendFile

        public void appendFile​(org.apache.hadoop.conf.Configuration conf,
                               org.apache.hadoop.fs.Path file)
                        throws IOException
        will be removed in 2.0.0; use appendFile(InputFile) instead
        conf - a configuration
        file - a file path to append the contents of to this file
        IOException - if there is an error while reading or writing
      • appendFile

        public void appendFile​( file)
                        throws IOException
      • appendRowGroups

        public void appendRowGroups​(org.apache.hadoop.fs.FSDataInputStream file,
                                    List<BlockMetaData> rowGroups,
                                    boolean dropColumns)
                             throws IOException
        will be removed in 2.0.0; use appendRowGroups(SeekableInputStream,List,boolean) instead
        file - a file stream to read from
        rowGroups - row groups to copy
        dropColumns - whether to drop columns from the file that are not in this file's schema
        IOException - if there is an error while reading or writing
      • appendRowGroup

        public void appendRowGroup​( from,
                                   BlockMetaData rowGroup,
                                   boolean dropColumns)
                            throws IOException
      • appendColumnChunk

        public void appendColumnChunk​(org.apache.parquet.column.ColumnDescriptor descriptor,
                                      ColumnChunkMetaData chunk,
                                      org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter,
                                      org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex,
                                      org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex)
                               throws IOException
        descriptor - the descriptor for the target column
        from - a file stream to read from
        chunk - the column chunk to be copied
        bloomFilter - the bloomFilter for this chunk
        columnIndex - the column index for this chunk
        offsetIndex - the offset index for this chunk
      • end

        public void end​(Map<String,​String> extraMetaData)
                 throws IOException
        ends a file once all blocks have been written. closes the file.
        extraMetaData - the extra meta data to write in the footer
        IOException - if there is an error while writing
      • mergeMetadataFiles

        public static ParquetMetadata mergeMetadataFiles​(List<org.apache.hadoop.fs.Path> files,
                                                         org.apache.hadoop.conf.Configuration conf)
                                                  throws IOException
        metadata files are not recommended and will be removed in 2.0.0
        Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.
        files - a list of files to merge metadata from
        conf - a configuration
        merged parquet metadata for the files
        IOException - if there is an error while writing
      • mergeMetadataFiles

        public static ParquetMetadata mergeMetadataFiles​(List<org.apache.hadoop.fs.Path> files,
                                                         org.apache.hadoop.conf.Configuration conf,
                                                         KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy)
                                                  throws IOException
        metadata files are not recommended and will be removed in 2.0.0
        Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.
        files - a list of files to merge metadata from
        conf - a configuration
        keyValueMetadataMergeStrategy - strategy to merge values for same key, if there are multiple
        merged parquet metadata for the files
        IOException - if there is an error while writing
      • writeMergedMetadataFile

        public static void writeMergedMetadataFile​(List<org.apache.hadoop.fs.Path> files,
                                                   org.apache.hadoop.fs.Path outputPath,
                                                   org.apache.hadoop.conf.Configuration conf)
                                            throws IOException
        metadata files are not recommended and will be removed in 2.0.0
        Given a list of metadata files, merge them into a single metadata file. Requires that the schemas be compatible, and the extraMetaData be exactly equal. This is useful when merging 2 directories of parquet files into a single directory, as long as both directories were written with compatible schemas and equal extraMetaData.
        files - a list of files to merge metadata from
        outputPath - path to write merged metadata to
        conf - a configuration
        IOException - if there is an error while reading or writing
      • writeMetadataFile

        public static void writeMetadataFile​(org.apache.hadoop.conf.Configuration configuration,
                                             org.apache.hadoop.fs.Path outputPath,
                                             List<Footer> footers)
                                      throws IOException
        metadata files are not recommended and will be removed in 2.0.0
        writes a _metadata and _common_metadata file
        configuration - the configuration to use to get the FileSystem
        outputPath - the directory to write the _metadata file to
        footers - the list of footers to merge
        IOException - if there is an error while writing
      • writeMetadataFile

        public static void writeMetadataFile​(org.apache.hadoop.conf.Configuration configuration,
                                             org.apache.hadoop.fs.Path outputPath,
                                             List<Footer> footers,
                                             ParquetOutputFormat.JobSummaryLevel level)
                                      throws IOException
        metadata files are not recommended and will be removed in 2.0.0
        writes _common_metadata file, and optionally a _metadata file depending on the ParquetOutputFormat.JobSummaryLevel provided
        configuration - the configuration to use to get the FileSystem
        outputPath - the directory to write the _metadata file to
        footers - the list of footers to merge
        level - level of summary to write
        IOException - if there is an error while writing
      • getPos

        public long getPos()
                    throws IOException
        the current position in the underlying file
        IOException - if there is an error while getting the current stream's position