Package org.apache.parquet.hadoop
Class ParquetInputFormat<T>
- java.lang.Object
-
- org.apache.hadoop.mapreduce.InputFormat<K,V>
-
- org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,T>
-
- org.apache.parquet.hadoop.ParquetInputFormat<T>
-
- Type Parameters:
T
- the type of the materialized records
- Direct Known Subclasses:
ExampleInputFormat
public class ParquetInputFormat<T> extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,T>
The input format to read a Parquet file. It requires an implementation ofReadSupport
to materialize the records. The requestedSchema will control how the original records get projected by the loader. It must be a subset of the original schema. Only the columns needed to reconstruct the records with the requestedSchema will be scanned.
-
-
Field Summary
Fields Modifier and Type Field Description static String
BLOOM_FILTERING_ENABLED
key to configure whether row group bloom filtering is enabledstatic String
COLUMN_INDEX_FILTERING_ENABLED
key to configure whether column index filtering of pages is enabledstatic String
DICTIONARY_FILTERING_ENABLED
key to configure whether row group dictionary filtering is enabledstatic String
FILTER_PREDICATE
key to configure the filter predicatestatic String
PAGE_VERIFY_CHECKSUM_ENABLED
key to configure whether page level checksum verification is enabledstatic String
READ_SUPPORT_CLASS
key to configure the ReadSupport implementationstatic String
RECORD_FILTERING_ENABLED
key to configure whether record-level filtering is enabledstatic String
SPLIT_FILES
key to turn off file splitting.static String
STATS_FILTERING_ENABLED
key to configure whether row group stats filtering is enabledstatic String
STRICT_TYPE_CHECKING
key to configure type checking for conflicting schemas (default: true)static String
TASK_SIDE_METADATA
key to turn on or off task side metadata loading (default true) if true then metadata is read on the task side and some tasks may finish immediately.static String
UNBOUND_RECORD_FILTER
key to configure the filter
-
Constructor Summary
Constructors Constructor Description ParquetInputFormat()
Hadoop will instantiate using this constructorParquetInputFormat(Class<S> readSupportClass)
Constructor for subclasses, such as AvroParquetInputFormat, or wrappers.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description org.apache.hadoop.mapreduce.RecordReader<Void,T>
createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
static org.apache.parquet.filter2.compat.FilterCompat.Filter
getFilter(org.apache.hadoop.conf.Configuration conf)
Returns a non-null Filter, which is a wrapper around either a FilterPredicate, an UnboundRecordFilter, or a no-op filter.List<Footer>
getFooters(org.apache.hadoop.conf.Configuration configuration, Collection<org.apache.hadoop.fs.FileStatus> statuses)
the footers for the filesList<Footer>
getFooters(org.apache.hadoop.conf.Configuration configuration, List<org.apache.hadoop.fs.FileStatus> statuses)
List<Footer>
getFooters(org.apache.hadoop.mapreduce.JobContext jobContext)
GlobalMetaData
getGlobalMetaData(org.apache.hadoop.mapreduce.JobContext jobContext)
static Class<?>
getReadSupportClass(org.apache.hadoop.conf.Configuration configuration)
static <T> ReadSupport<T>
getReadSupportInstance(org.apache.hadoop.conf.Configuration configuration)
List<ParquetInputSplit>
getSplits(org.apache.hadoop.conf.Configuration configuration, List<Footer> footers)
Deprecated.split planning using file footers will be removedList<org.apache.hadoop.mapreduce.InputSplit>
getSplits(org.apache.hadoop.mapreduce.JobContext jobContext)
static Class<?>
getUnboundRecordFilter(org.apache.hadoop.conf.Configuration configuration)
Deprecated.protected boolean
isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)
static boolean
isTaskSideMetaData(org.apache.hadoop.conf.Configuration configuration)
protected List<org.apache.hadoop.fs.FileStatus>
listStatus(org.apache.hadoop.mapreduce.JobContext jobContext)
static void
setFilterPredicate(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)
static void
setReadSupportClass(org.apache.hadoop.mapred.JobConf conf, Class<?> readSupportClass)
static void
setReadSupportClass(org.apache.hadoop.mapreduce.Job job, Class<?> readSupportClass)
static void
setTaskSideMetaData(org.apache.hadoop.mapreduce.Job job, boolean taskSideMetadata)
static void
setUnboundRecordFilter(org.apache.hadoop.mapreduce.Job job, Class<? extends org.apache.parquet.filter.UnboundRecordFilter> filterClass)
-
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
-
-
-
-
Field Detail
-
READ_SUPPORT_CLASS
public static final String READ_SUPPORT_CLASS
key to configure the ReadSupport implementation- See Also:
- Constant Field Values
-
UNBOUND_RECORD_FILTER
public static final String UNBOUND_RECORD_FILTER
key to configure the filter- See Also:
- Constant Field Values
-
STRICT_TYPE_CHECKING
public static final String STRICT_TYPE_CHECKING
key to configure type checking for conflicting schemas (default: true)- See Also:
- Constant Field Values
-
FILTER_PREDICATE
public static final String FILTER_PREDICATE
key to configure the filter predicate- See Also:
- Constant Field Values
-
RECORD_FILTERING_ENABLED
public static final String RECORD_FILTERING_ENABLED
key to configure whether record-level filtering is enabled- See Also:
- Constant Field Values
-
STATS_FILTERING_ENABLED
public static final String STATS_FILTERING_ENABLED
key to configure whether row group stats filtering is enabled- See Also:
- Constant Field Values
-
DICTIONARY_FILTERING_ENABLED
public static final String DICTIONARY_FILTERING_ENABLED
key to configure whether row group dictionary filtering is enabled- See Also:
- Constant Field Values
-
COLUMN_INDEX_FILTERING_ENABLED
public static final String COLUMN_INDEX_FILTERING_ENABLED
key to configure whether column index filtering of pages is enabled- See Also:
- Constant Field Values
-
PAGE_VERIFY_CHECKSUM_ENABLED
public static final String PAGE_VERIFY_CHECKSUM_ENABLED
key to configure whether page level checksum verification is enabled- See Also:
- Constant Field Values
-
BLOOM_FILTERING_ENABLED
public static final String BLOOM_FILTERING_ENABLED
key to configure whether row group bloom filtering is enabled- See Also:
- Constant Field Values
-
TASK_SIDE_METADATA
public static final String TASK_SIDE_METADATA
key to turn on or off task side metadata loading (default true) if true then metadata is read on the task side and some tasks may finish immediately. if false metadata is read on the client which is slower if there is a lot of metadata but tasks will only be spawn if there is work to do.- See Also:
- Constant Field Values
-
SPLIT_FILES
public static final String SPLIT_FILES
key to turn off file splitting. See PARQUET-246.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ParquetInputFormat
public ParquetInputFormat()
Hadoop will instantiate using this constructor
-
ParquetInputFormat
public ParquetInputFormat(Class<S> readSupportClass)
Constructor for subclasses, such as AvroParquetInputFormat, or wrappers.Subclasses and wrappers may use this constructor to set the ReadSupport class that will be used when reading instead of requiring the user to set the read support property in their configuration.
- Type Parameters:
S
- the Java read support type- Parameters:
readSupportClass
- a ReadSupport subclass
-
-
Method Detail
-
setTaskSideMetaData
public static void setTaskSideMetaData(org.apache.hadoop.mapreduce.Job job, boolean taskSideMetadata)
-
isTaskSideMetaData
public static boolean isTaskSideMetaData(org.apache.hadoop.conf.Configuration configuration)
-
setReadSupportClass
public static void setReadSupportClass(org.apache.hadoop.mapreduce.Job job, Class<?> readSupportClass)
-
setUnboundRecordFilter
public static void setUnboundRecordFilter(org.apache.hadoop.mapreduce.Job job, Class<? extends org.apache.parquet.filter.UnboundRecordFilter> filterClass)
-
getUnboundRecordFilter
@Deprecated public static Class<?> getUnboundRecordFilter(org.apache.hadoop.conf.Configuration configuration)
Deprecated.- Parameters:
configuration
- a configuration- Returns:
- an unbound record filter class
-
setReadSupportClass
public static void setReadSupportClass(org.apache.hadoop.mapred.JobConf conf, Class<?> readSupportClass)
-
getReadSupportClass
public static Class<?> getReadSupportClass(org.apache.hadoop.conf.Configuration configuration)
-
setFilterPredicate
public static void setFilterPredicate(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)
-
getFilter
public static org.apache.parquet.filter2.compat.FilterCompat.Filter getFilter(org.apache.hadoop.conf.Configuration conf)
Returns a non-null Filter, which is a wrapper around either a FilterPredicate, an UnboundRecordFilter, or a no-op filter.- Parameters:
conf
- a configuration- Returns:
- a filter for the unbound record filter specified in conf
-
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader<Void,T> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
- Specified by:
createRecordReader
in classorg.apache.hadoop.mapreduce.InputFormat<Void,T>
- Throws:
IOException
InterruptedException
-
getReadSupportInstance
public static <T> ReadSupport<T> getReadSupportInstance(org.apache.hadoop.conf.Configuration configuration)
- Type Parameters:
T
- the Java type of objects created by the ReadSupport- Parameters:
configuration
- to find the configuration for the read support- Returns:
- the configured read support
-
isSplitable
protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path filename)
-
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
- Overrides:
getSplits
in classorg.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,T>
- Throws:
IOException
-
getSplits
@Deprecated public List<ParquetInputSplit> getSplits(org.apache.hadoop.conf.Configuration configuration, List<Footer> footers) throws IOException
Deprecated.split planning using file footers will be removed- Parameters:
configuration
- the configuration to connect to the file systemfooters
- the footers of the files to read- Returns:
- the splits for the footers
- Throws:
IOException
- if there is an error while reading
-
listStatus
protected List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
- Overrides:
listStatus
in classorg.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,T>
- Throws:
IOException
-
getFooters
public List<Footer> getFooters(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
- Parameters:
jobContext
- the current job context- Returns:
- the footers for the files
- Throws:
IOException
- if there is an error while reading
-
getFooters
public List<Footer> getFooters(org.apache.hadoop.conf.Configuration configuration, List<org.apache.hadoop.fs.FileStatus> statuses) throws IOException
- Throws:
IOException
-
getFooters
public List<Footer> getFooters(org.apache.hadoop.conf.Configuration configuration, Collection<org.apache.hadoop.fs.FileStatus> statuses) throws IOException
the footers for the files- Parameters:
configuration
- to connect to the file systemstatuses
- the files to open- Returns:
- the footers of the files
- Throws:
IOException
- if there is an error while reading
-
getGlobalMetaData
public GlobalMetaData getGlobalMetaData(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
- Parameters:
jobContext
- the current job context- Returns:
- the merged metadata from the footers
- Throws:
IOException
- if there is an error while reading
-
-