Class ParquetInputFormat<T>

  • Type Parameters:
    T - the type of the materialized records
    Direct Known Subclasses:
    ExampleInputFormat

    public class ParquetInputFormat<T>
    extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,​T>
    The input format to read a Parquet file. It requires an implementation of ReadSupport to materialize the records. The requestedSchema will control how the original records get projected by the loader. It must be a subset of the original schema. Only the columns needed to reconstruct the records with the requestedSchema will be scanned.
    See Also:
    READ_SUPPORT_CLASS, UNBOUND_RECORD_FILTER, STRICT_TYPE_CHECKING, FILTER_PREDICATE, TASK_SIDE_METADATA
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

        org.apache.hadoop.mapreduce.lib.input.FileInputFormat.Counter
    • Constructor Summary

      Constructors 
      Constructor Description
      ParquetInputFormat()
      Hadoop will instantiate using this constructor
      ParquetInputFormat​(Class<S> readSupportClass)
      Constructor for subclasses, such as AvroParquetInputFormat, or wrappers.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      org.apache.hadoop.mapreduce.RecordReader<Void,​T> createRecordReader​(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
      static org.apache.parquet.filter2.compat.FilterCompat.Filter getFilter​(org.apache.hadoop.conf.Configuration conf)
      Returns a non-null Filter, which is a wrapper around either a FilterPredicate, an UnboundRecordFilter, or a no-op filter.
      List<Footer> getFooters​(org.apache.hadoop.conf.Configuration configuration, Collection<org.apache.hadoop.fs.FileStatus> statuses)
      the footers for the files
      List<Footer> getFooters​(org.apache.hadoop.conf.Configuration configuration, List<org.apache.hadoop.fs.FileStatus> statuses)  
      List<Footer> getFooters​(org.apache.hadoop.mapreduce.JobContext jobContext)  
      GlobalMetaData getGlobalMetaData​(org.apache.hadoop.mapreduce.JobContext jobContext)  
      static Class<?> getReadSupportClass​(org.apache.hadoop.conf.Configuration configuration)  
      static <T> ReadSupport<T> getReadSupportInstance​(org.apache.hadoop.conf.Configuration configuration)  
      List<ParquetInputSplit> getSplits​(org.apache.hadoop.conf.Configuration configuration, List<Footer> footers)
      Deprecated.
      split planning using file footers will be removed
      List<org.apache.hadoop.mapreduce.InputSplit> getSplits​(org.apache.hadoop.mapreduce.JobContext jobContext)
      static Class<?> getUnboundRecordFilter​(org.apache.hadoop.conf.Configuration configuration)
      Deprecated.
      protected boolean isSplitable​(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)  
      static boolean isTaskSideMetaData​(org.apache.hadoop.conf.Configuration configuration)  
      protected List<org.apache.hadoop.fs.FileStatus> listStatus​(org.apache.hadoop.mapreduce.JobContext jobContext)  
      static void setFilterPredicate​(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)  
      static void setReadSupportClass​(org.apache.hadoop.mapred.JobConf conf, Class<?> readSupportClass)  
      static void setReadSupportClass​(org.apache.hadoop.mapreduce.Job job, Class<?> readSupportClass)  
      static void setTaskSideMetaData​(org.apache.hadoop.mapreduce.Job job, boolean taskSideMetadata)  
      static void setUnboundRecordFilter​(org.apache.hadoop.mapreduce.Job job, Class<? extends org.apache.parquet.filter.UnboundRecordFilter> filterClass)  
      • Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

        addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
    • Field Detail

      • READ_SUPPORT_CLASS

        public static final String READ_SUPPORT_CLASS
        key to configure the ReadSupport implementation
        See Also:
        Constant Field Values
      • STRICT_TYPE_CHECKING

        public static final String STRICT_TYPE_CHECKING
        key to configure type checking for conflicting schemas (default: true)
        See Also:
        Constant Field Values
      • RECORD_FILTERING_ENABLED

        public static final String RECORD_FILTERING_ENABLED
        key to configure whether record-level filtering is enabled
        See Also:
        Constant Field Values
      • STATS_FILTERING_ENABLED

        public static final String STATS_FILTERING_ENABLED
        key to configure whether row group stats filtering is enabled
        See Also:
        Constant Field Values
      • DICTIONARY_FILTERING_ENABLED

        public static final String DICTIONARY_FILTERING_ENABLED
        key to configure whether row group dictionary filtering is enabled
        See Also:
        Constant Field Values
      • COLUMN_INDEX_FILTERING_ENABLED

        public static final String COLUMN_INDEX_FILTERING_ENABLED
        key to configure whether column index filtering of pages is enabled
        See Also:
        Constant Field Values
      • PAGE_VERIFY_CHECKSUM_ENABLED

        public static final String PAGE_VERIFY_CHECKSUM_ENABLED
        key to configure whether page level checksum verification is enabled
        See Also:
        Constant Field Values
      • BLOOM_FILTERING_ENABLED

        public static final String BLOOM_FILTERING_ENABLED
        key to configure whether row group bloom filtering is enabled
        See Also:
        Constant Field Values
      • TASK_SIDE_METADATA

        public static final String TASK_SIDE_METADATA
        key to turn on or off task side metadata loading (default true) if true then metadata is read on the task side and some tasks may finish immediately. if false metadata is read on the client which is slower if there is a lot of metadata but tasks will only be spawn if there is work to do.
        See Also:
        Constant Field Values
    • Constructor Detail

      • ParquetInputFormat

        public ParquetInputFormat()
        Hadoop will instantiate using this constructor
      • ParquetInputFormat

        public ParquetInputFormat​(Class<S> readSupportClass)
        Constructor for subclasses, such as AvroParquetInputFormat, or wrappers.

        Subclasses and wrappers may use this constructor to set the ReadSupport class that will be used when reading instead of requiring the user to set the read support property in their configuration.

        Type Parameters:
        S - the Java read support type
        Parameters:
        readSupportClass - a ReadSupport subclass
    • Method Detail

      • setTaskSideMetaData

        public static void setTaskSideMetaData​(org.apache.hadoop.mapreduce.Job job,
                                               boolean taskSideMetadata)
      • isTaskSideMetaData

        public static boolean isTaskSideMetaData​(org.apache.hadoop.conf.Configuration configuration)
      • setReadSupportClass

        public static void setReadSupportClass​(org.apache.hadoop.mapreduce.Job job,
                                               Class<?> readSupportClass)
      • setUnboundRecordFilter

        public static void setUnboundRecordFilter​(org.apache.hadoop.mapreduce.Job job,
                                                  Class<? extends org.apache.parquet.filter.UnboundRecordFilter> filterClass)
      • getUnboundRecordFilter

        @Deprecated
        public static Class<?> getUnboundRecordFilter​(org.apache.hadoop.conf.Configuration configuration)
        Deprecated.
        Parameters:
        configuration - a configuration
        Returns:
        an unbound record filter class
      • setReadSupportClass

        public static void setReadSupportClass​(org.apache.hadoop.mapred.JobConf conf,
                                               Class<?> readSupportClass)
      • getReadSupportClass

        public static Class<?> getReadSupportClass​(org.apache.hadoop.conf.Configuration configuration)
      • setFilterPredicate

        public static void setFilterPredicate​(org.apache.hadoop.conf.Configuration configuration,
                                              org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)
      • getFilter

        public static org.apache.parquet.filter2.compat.FilterCompat.Filter getFilter​(org.apache.hadoop.conf.Configuration conf)
        Returns a non-null Filter, which is a wrapper around either a FilterPredicate, an UnboundRecordFilter, or a no-op filter.
        Parameters:
        conf - a configuration
        Returns:
        a filter for the unbound record filter specified in conf
      • createRecordReader

        public org.apache.hadoop.mapreduce.RecordReader<Void,​T> createRecordReader​(org.apache.hadoop.mapreduce.InputSplit inputSplit,
                                                                                         org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
                                                                                  throws IOException,
                                                                                         InterruptedException
        Specified by:
        createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<Void,​T>
        Throws:
        IOException
        InterruptedException
      • getReadSupportInstance

        public static <T> ReadSupport<T> getReadSupportInstance​(org.apache.hadoop.conf.Configuration configuration)
        Type Parameters:
        T - the Java type of objects created by the ReadSupport
        Parameters:
        configuration - to find the configuration for the read support
        Returns:
        the configured read support
      • isSplitable

        protected boolean isSplitable​(org.apache.hadoop.mapreduce.JobContext context,
                                      org.apache.hadoop.fs.Path filename)
        Overrides:
        isSplitable in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,​T>
      • getSplits

        public List<org.apache.hadoop.mapreduce.InputSplit> getSplits​(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                               throws IOException
        Overrides:
        getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,​T>
        Throws:
        IOException
      • getSplits

        @Deprecated
        public List<ParquetInputSplit> getSplits​(org.apache.hadoop.conf.Configuration configuration,
                                                 List<Footer> footers)
                                          throws IOException
        Deprecated.
        split planning using file footers will be removed
        Parameters:
        configuration - the configuration to connect to the file system
        footers - the footers of the files to read
        Returns:
        the splits for the footers
        Throws:
        IOException - if there is an error while reading
      • listStatus

        protected List<org.apache.hadoop.fs.FileStatus> listStatus​(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                            throws IOException
        Overrides:
        listStatus in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,​T>
        Throws:
        IOException
      • getFooters

        public List<Footer> getFooters​(org.apache.hadoop.mapreduce.JobContext jobContext)
                                throws IOException
        Parameters:
        jobContext - the current job context
        Returns:
        the footers for the files
        Throws:
        IOException - if there is an error while reading
      • getFooters

        public List<Footer> getFooters​(org.apache.hadoop.conf.Configuration configuration,
                                       List<org.apache.hadoop.fs.FileStatus> statuses)
                                throws IOException
        Throws:
        IOException
      • getFooters

        public List<Footer> getFooters​(org.apache.hadoop.conf.Configuration configuration,
                                       Collection<org.apache.hadoop.fs.FileStatus> statuses)
                                throws IOException
        the footers for the files
        Parameters:
        configuration - to connect to the file system
        statuses - the files to open
        Returns:
        the footers of the files
        Throws:
        IOException - if there is an error while reading
      • getGlobalMetaData

        public GlobalMetaData getGlobalMetaData​(org.apache.hadoop.mapreduce.JobContext jobContext)
                                         throws IOException
        Parameters:
        jobContext - the current job context
        Returns:
        the merged metadata from the footers
        Throws:
        IOException - if there is an error while reading