Class ParquetOutputFormat<T>

  • Type Parameters:
    T - the type of the materialized records
    Direct Known Subclasses:
    ExampleOutputFormat

    public class ParquetOutputFormat<T>
    extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,​T>
    OutputFormat to write to a Parquet file It requires a WriteSupport to convert the actual records to the underlying format. It requires the schema of the incoming records. (provided by the write support) It allows storing extra metadata in the footer (for example: for schema compatibility purpose when converting from a different schema language). The format configuration settings in the job configuration:
     # The block size is the size of a row group being buffered in memory
     # this limits the memory usage when writing
     # Larger values will improve the IO when reading but consume more memory when writing
     parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
    
     # The page size is for compression. When reading, each page can be decompressed independently.
     # A block is composed of pages. The page is the smallest unit that must be read fully to access a single record.
     # If this value is too small, the compression will deteriorate
     parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
    
     # There is one dictionary page per column per row group when dictionary encoding is used.
     # The dictionary page size works like the page size but for dictionary
     parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
    
     # The compression algorithm used to compress pages
     parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD. Default: UNCOMPRESSED. Supersedes mapred.output.compress*
    
     # The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer
     # Usually provided by a specific ParquetOutputFormat subclass
     parquet.write.support.class= # fully qualified name
    
     # To enable/disable dictionary encoding
     parquet.enable.dictionary=true # false to disable dictionary encoding
    
     # To enable/disable summary metadata aggregation at the end of a MR job
     # The default is true (enabled)
     parquet.enable.summary-metadata=true # false to disable summary aggregation
    
     # Maximum size (in bytes) allowed as padding to align row groups
     # This is also the minimum size of a row group. Default: 8388608
     parquet.writer.max-padding=8388608 # 8 MB
     
    If parquet.compression is not set, the following properties are checked (FileOutputFormat behavior). Note that we explicitely disallow custom Codecs
     mapred.output.compress=true
     mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec # the codec must be one of Snappy, GZip or LZO
     
    if none of those is set the data is uncompressed.
    • Constructor Detail

      • ParquetOutputFormat

        public ParquetOutputFormat​(S writeSupport)
        constructor used when this OutputFormat in wrapped in another one (In Pig for example)
        Type Parameters:
        S - the Java write support type
        Parameters:
        writeSupport - the class used to convert the incoming records
      • ParquetOutputFormat

        public ParquetOutputFormat()
        used when directly using the output format and configuring the write support implementation using parquet.write.support.class
        Type Parameters:
        S - the Java write support type
    • Method Detail

      • setWriteSupportClass

        public static void setWriteSupportClass​(org.apache.hadoop.mapreduce.Job job,
                                                Class<?> writeSupportClass)
      • setWriteSupportClass

        public static void setWriteSupportClass​(org.apache.hadoop.mapred.JobConf job,
                                                Class<?> writeSupportClass)
      • getWriteSupportClass

        public static Class<?> getWriteSupportClass​(org.apache.hadoop.conf.Configuration configuration)
      • setBlockSize

        public static void setBlockSize​(org.apache.hadoop.mapreduce.Job job,
                                        int blockSize)
      • setPageSize

        public static void setPageSize​(org.apache.hadoop.mapreduce.Job job,
                                       int pageSize)
      • setDictionaryPageSize

        public static void setDictionaryPageSize​(org.apache.hadoop.mapreduce.Job job,
                                                 int pageSize)
      • setCompression

        public static void setCompression​(org.apache.hadoop.mapreduce.Job job,
                                          org.apache.parquet.hadoop.metadata.CompressionCodecName compression)
      • setEnableDictionary

        public static void setEnableDictionary​(org.apache.hadoop.mapreduce.Job job,
                                               boolean enableDictionary)
      • getEnableDictionary

        public static boolean getEnableDictionary​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • getBloomFilterMaxBytes

        public static int getBloomFilterMaxBytes​(org.apache.hadoop.conf.Configuration conf)
      • getBloomFilterEnabled

        public static boolean getBloomFilterEnabled​(org.apache.hadoop.conf.Configuration conf)
      • getBlockSize

        public static int getBlockSize​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • getPageSize

        public static int getPageSize​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • getDictionaryPageSize

        public static int getDictionaryPageSize​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • getCompression

        public static org.apache.parquet.hadoop.metadata.CompressionCodecName getCompression​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • isCompressionSet

        public static boolean isCompressionSet​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • setValidation

        public static void setValidation​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                         boolean validating)
      • getValidation

        public static boolean getValidation​(org.apache.hadoop.mapreduce.JobContext jobContext)
      • getEnableDictionary

        public static boolean getEnableDictionary​(org.apache.hadoop.conf.Configuration configuration)
      • getMinRowCountForPageSizeCheck

        public static int getMinRowCountForPageSizeCheck​(org.apache.hadoop.conf.Configuration configuration)
      • getMaxRowCountForPageSizeCheck

        public static int getMaxRowCountForPageSizeCheck​(org.apache.hadoop.conf.Configuration configuration)
      • getEstimatePageSizeCheck

        public static boolean getEstimatePageSizeCheck​(org.apache.hadoop.conf.Configuration configuration)
      • getBlockSize

        @Deprecated
        public static int getBlockSize​(org.apache.hadoop.conf.Configuration configuration)
        Deprecated.
      • getLongBlockSize

        public static long getLongBlockSize​(org.apache.hadoop.conf.Configuration configuration)
      • getPageSize

        public static int getPageSize​(org.apache.hadoop.conf.Configuration configuration)
      • getDictionaryPageSize

        public static int getDictionaryPageSize​(org.apache.hadoop.conf.Configuration configuration)
      • getWriterVersion

        public static org.apache.parquet.column.ParquetProperties.WriterVersion getWriterVersion​(org.apache.hadoop.conf.Configuration configuration)
      • getCompression

        public static org.apache.parquet.hadoop.metadata.CompressionCodecName getCompression​(org.apache.hadoop.conf.Configuration configuration)
      • isCompressionSet

        public static boolean isCompressionSet​(org.apache.hadoop.conf.Configuration configuration)
      • setValidation

        public static void setValidation​(org.apache.hadoop.conf.Configuration configuration,
                                         boolean validating)
      • getValidation

        public static boolean getValidation​(org.apache.hadoop.conf.Configuration configuration)
      • setMaxPaddingSize

        public static void setMaxPaddingSize​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                             int maxPaddingSize)
      • setMaxPaddingSize

        public static void setMaxPaddingSize​(org.apache.hadoop.conf.Configuration conf,
                                             int maxPaddingSize)
      • setColumnIndexTruncateLength

        public static void setColumnIndexTruncateLength​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                                        int length)
      • setColumnIndexTruncateLength

        public static void setColumnIndexTruncateLength​(org.apache.hadoop.conf.Configuration conf,
                                                        int length)
      • setStatisticsTruncateLength

        public static void setStatisticsTruncateLength​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                                       int length)
      • setPageRowCountLimit

        public static void setPageRowCountLimit​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                                int rowCount)
      • setPageRowCountLimit

        public static void setPageRowCountLimit​(org.apache.hadoop.conf.Configuration conf,
                                                int rowCount)
      • setPageWriteChecksumEnabled

        public static void setPageWriteChecksumEnabled​(org.apache.hadoop.mapreduce.JobContext jobContext,
                                                       boolean val)
      • setPageWriteChecksumEnabled

        public static void setPageWriteChecksumEnabled​(org.apache.hadoop.conf.Configuration conf,
                                                       boolean val)
      • getPageWriteChecksumEnabled

        public static boolean getPageWriteChecksumEnabled​(org.apache.hadoop.conf.Configuration conf)
      • getWriteSupport

        public WriteSupport<T> getWriteSupport​(org.apache.hadoop.conf.Configuration configuration)
        Parameters:
        configuration - to find the configuration for the write support class
        Returns:
        the configured write support
      • getOutputCommitter

        public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter​(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                       throws IOException
        Overrides:
        getOutputCommitter in class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,​T>
        Throws:
        IOException
      • getMemoryManager

        public static MemoryManager getMemoryManager()