Package

com.twitter.scalding.parquet

thrift

Permalink

package thrift

Visibility
  1. Public
  2. All

Type Members

  1. class DailySuffixParquetThrift[T <: ThriftBase] extends DailySuffixSource with ParquetThrift[T]

    Permalink

    When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).

    When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).

    The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).

    For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

    These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection

    Here are two ways you can use these in a parquet source:

    class MyParquetSource(dr: DateRange) extends DailySuffixParquetThrift("/a/path", dr)
    
    val mySourceFilteredAndProjected = new MyParquetSource(dr) {
      override val withFilter: Option[FilterPredicate] = Some(myFp)
      override val withColumnProjections: Set[String] = Set("a.b.c", "x.y")
    }

    The other way is to add these as constructor arguments:

    class MyParquetSource(
      dr: DateRange,
      override val withFilter: Option[FilterPredicate] = None
      override val withColumnProjections: Set[String] = Set()
    ) extends DailySuffixParquetThrift("/a/path", dr)
    
    val mySourceFilteredAndProjected = new MyParquetSource(dr, Some(myFp), Set("a.b.c", "x.y"))
  2. class FixedPathParquetThrift[T <: ThriftBase] extends FixedPathSource with ParquetThrift[T]

    Permalink
  3. class HourlySuffixParquetThrift[T <: ThriftBase] extends HourlySuffixSource with ParquetThrift[T]

    Permalink
  4. class Parquet346TBaseRecordConverter[T <: TBase[_, _]] extends ThriftRecordConverter[T]

    Permalink

    Same as TBaseRecordConverter with one important (subtle) difference.

    Same as TBaseRecordConverter with one important (subtle) difference. It passes a repaired schema (StructType) to ThriftRecordConverter's constructor. This is important because older files don't contain all the metadata needed for ThriftSchemaConverter to not throw, but we can put dummy data in there because it's not actually used.

  5. class Parquet346TBaseScheme[T <: TBase[_, _]] extends ParquetTBaseScheme[T]

    Permalink

    The same as ParquetTBaseScheme, but sets the record convert to Parquet346TBaseRecordConverter

  6. class ParquetTBaseScheme[T <: TBase[_, _]] extends ParquetValueScheme[T]

    Permalink
  7. trait ParquetThrift[T <: ThriftBase] extends FileSource with ParquetThriftBase[T]

    Permalink
  8. trait ParquetThriftBase[T] extends FileSource with SingleMappable[T] with TypedSink[T] with LocalTapSource with HasFilterPredicate with HasColumnProjection

    Permalink

Value Members

  1. object Parquet346StructTypeRepairer extends StateVisitor[ThriftType, Unit]

    Permalink

    Takes a ThriftType with potentially missing structOrUnionType metadata, and makes a copy that sets all StructOrUnionType metadata to UNION

  2. object ParquetThrift extends Serializable

    Permalink

Ungrouped