When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
Same as TBaseRecordConverter with one important (subtle) difference.
Same as TBaseRecordConverter with one important (subtle) difference. It passes a repaired schema (StructType) to ThriftRecordConverter's constructor. This is important because older files don't contain all the metadata needed for ThriftSchemaConverter to not throw, but we can put dummy data in there because it's not actually used.
The same as ParquetTBaseScheme, but sets the record convert to Parquet346TBaseRecordConverter
Takes a ThriftType with potentially missing structOrUnionType metadata, and makes a copy that sets all StructOrUnionType metadata to UNION
When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).
For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection
Here are two ways you can use these in a parquet source:
The other way is to add these as constructor arguments: