Class

com.twitter.scalding.parquet.thrift

DailySuffixParquetThrift

Related Doc: package thrift

Permalink

class DailySuffixParquetThrift[T <: ThriftBase] extends DailySuffixSource with ParquetThrift[T]

When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).

The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).

For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection

Here are two ways you can use these in a parquet source:

class MyParquetSource(dr: DateRange) extends DailySuffixParquetThrift("/a/path", dr)

val mySourceFilteredAndProjected = new MyParquetSource(dr) {
  override val withFilter: Option[FilterPredicate] = Some(myFp)
  override val withColumnProjections: Set[String] = Set("a.b.c", "x.y")
}

The other way is to add these as constructor arguments:

class MyParquetSource(
  dr: DateRange,
  override val withFilter: Option[FilterPredicate] = None
  override val withColumnProjections: Set[String] = Set()
) extends DailySuffixParquetThrift("/a/path", dr)

val mySourceFilteredAndProjected = new MyParquetSource(dr, Some(myFp), Set("a.b.c", "x.y"))
Linear Supertypes
ParquetThrift[T], ParquetThriftBase[T], HasColumnProjection, HasFilterPredicate, LocalTapSource, typed.TypedSink[T], SingleMappable[T], Mappable[T], typed.TypedSource[T], DailySuffixSource, TimePathedSource, TimeSeqPathedSource, FileSource, HfsTapProvider, LocalSourceOverride, SchemedSource, Source, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DailySuffixParquetThrift
  2. ParquetThrift
  3. ParquetThriftBase
  4. HasColumnProjection
  5. HasFilterPredicate
  6. LocalTapSource
  7. TypedSink
  8. SingleMappable
  9. Mappable
  10. TypedSource
  11. DailySuffixSource
  12. TimePathedSource
  13. TimeSeqPathedSource
  14. FileSource
  15. HfsTapProvider
  16. LocalSourceOverride
  17. SchemedSource
  18. Source
  19. Serializable
  20. AnyRef
  21. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DailySuffixParquetThrift(path: String, dateRange: DateRange)(implicit mf: Manifest[T])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def allPaths: Iterable[String]

    Permalink
    Definition Classes
    TimeSeqPathedSource
  5. def allPathsFor(pattern: String): Iterable[String]

    Permalink
    Attributes
    protected
    Definition Classes
    TimeSeqPathedSource
  6. def andThen[U](fn: (T) ⇒ U): typed.TypedSource[U]

    Permalink
    Definition Classes
    TypedSource
  7. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  8. def checkFlowDefNotNull()(implicit flowDef: FlowDef, mode: Mode): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Source
  9. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. final def columnProjectionString: Option[ColumnProjectionString]

    Permalink

    Parquet accepts globs separated by the ; character

    Parquet accepts globs separated by the ; character

    Attributes
    protected[com.twitter.scalding.parquet]
    Definition Classes
    HasColumnProjection
  11. def config: cascading.ParquetValueScheme.Config[T]

    Permalink
    Definition Classes
    ParquetThriftBase
  12. def contraMap[U](fn: (U) ⇒ T): typed.TypedSink[U]

    Permalink
    Definition Classes
    TypedSink
  13. def converter[U >: T]: TupleConverter[U]

    Permalink
    Definition Classes
    SingleMappable → TypedSource
  14. def createHdfsReadTap(hdfsMode: Hdfs): Tap[JobConf, _, _]

    Permalink
    Attributes
    protected
    Definition Classes
    FileSource
  15. def createHfsTap(scheme: Scheme[JobConf, RecordReader[_, _], OutputCollector[_, _], _, _], path: String, sinkMode: SinkMode): Hfs

    Permalink
    Definition Classes
    HfsTapProvider
  16. def createLocalTap(sinkMode: SinkMode): Tap[JobConf, _, _]

    Permalink
    Definition Classes
    LocalTapSource → LocalSourceOverride
  17. def createTap(readOrWrite: AccessMode)(implicit mode: Mode): Tap[_, _, _]

    Permalink
    Definition Classes
    FileSource → Source
  18. val dateRange: DateRange

    Permalink
    Definition Classes
    TimeSeqPathedSource
  19. def defaultDurationFor(pattern: String): Option[Duration]

    Permalink
    Attributes
    protected
    Definition Classes
    TimeSeqPathedSource
  20. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. def equals(that: Any): Boolean

    Permalink
    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  22. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  23. final def flatMapTo[U](out: Fields)(mf: (T) ⇒ TraversableOnce[U])(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]): Pipe

    Permalink
    Definition Classes
    Mappable
  24. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  25. def getPathStatuses(conf: Configuration): Iterable[(String, Boolean)]

    Permalink
    Definition Classes
    TimeSeqPathedSource
  26. def goodHdfsPaths(hdfsMode: Hdfs): Iterable[String]

    Permalink
    Attributes
    protected
    Definition Classes
    FileSource
  27. def hashCode(): Int

    Permalink
    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  28. def hdfsPaths: Iterable[String]

    Permalink
    Definition Classes
    TimeSeqPathedSource → FileSource
  29. def hdfsReadPathsAreGood(conf: Configuration): Boolean

    Permalink
    Definition Classes
    TimeSeqPathedSource → FileSource
  30. def hdfsScheme: Scheme[JobConf, RecordReader[_, _], OutputCollector[_, _], _, _]

    Permalink
    Definition Classes
    ParquetThrift → SchemedSource
  31. def hdfsWritePath: String

    Permalink
    Definition Classes
    TimePathedSource → FileSource
  32. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  33. def localPaths: Iterable[String]

    Permalink
    Definition Classes
    TimePathedSource → LocalSourceOverride
  34. def localScheme: Scheme[Properties, InputStream, OutputStream, _, _]

    Permalink
    Definition Classes
    SchemedSource
  35. def localWritePath: String

    Permalink
    Definition Classes
    LocalSourceOverride
  36. final def mapTo[U](out: Fields)(mf: (T) ⇒ U)(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]): Pipe

    Permalink
    Definition Classes
    Mappable
  37. implicit val mf: Manifest[T]

    Permalink
  38. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  39. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  40. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  41. def pathIsGood(p: String, conf: Configuration): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    FileSource
  42. val pattern: String

    Permalink
    Definition Classes
    TimePathedSource
  43. val patterns: Seq[String]

    Permalink
    Definition Classes
    TimeSeqPathedSource
  44. def read(implicit flowDef: FlowDef, mode: Mode): Pipe

    Permalink
    Definition Classes
    Source
  45. def setter[U <: T]: TupleSetter[U]

    Permalink
    Definition Classes
    ParquetThriftBase → TypedSink
  46. def sinkFields: Fields

    Permalink
    Definition Classes
    TypedSink
  47. val sinkMode: SinkMode

    Permalink
    Definition Classes
    SchemedSource
  48. def sourceFields: Fields

    Permalink
    Definition Classes
    TypedSource
  49. def sourceId: String

    Permalink
    Definition Classes
    Source
  50. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  51. def toIterator(implicit config: Config, mode: Mode): Iterator[T]

    Permalink
    Definition Classes
    Mappable
  52. def toString(): String

    Permalink
    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  53. def transformForRead(pipe: Pipe): Pipe

    Permalink
    Attributes
    protected
    Definition Classes
    Source
  54. def transformForWrite(pipe: Pipe): Pipe

    Permalink
    Attributes
    protected
    Definition Classes
    Source
  55. def transformInTest: Boolean

    Permalink
    Definition Classes
    Source
  56. val tz: TimeZone

    Permalink
    Definition Classes
    TimeSeqPathedSource
  57. def validateTaps(mode: Mode): Unit

    Permalink
    Definition Classes
    FileSource → Source
  58. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  59. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  60. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  61. def withColumnProjections: Set[String]

    Permalink

    The format for specifying columns is described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

    The format for specifying columns is described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

    Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.

    Definition Classes
    HasColumnProjection
  62. def withFilter: Option[FilterPredicate]

    Permalink
    Definition Classes
    HasFilterPredicate
  63. def writeFrom(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Permalink
    Definition Classes
    Source

Deprecated Value Members

  1. def readAtSubmitter[T](implicit mode: Mode, conv: TupleConverter[T]): Stream[T]

    Permalink
    Definition Classes
    Source
    Annotations
    @deprecated
    Deprecated

    (Since version 0.9.0) replace with Mappable.toIterator

  2. def withColumns: Set[String]

    Permalink

    Deprecated.

    Deprecated. Use withColumnProjections, which uses a different glob syntax.

    The format for specifying columns is described here: https://github.com/apache/parquet-mr/blob/3df3372a1ee7b6ea74af89f53a614895b8078609/parquet_cascading.md#2-projection-pushdown (Note that this link is different from the one below in withColumnProjections)

    Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.

    Definition Classes
    HasColumnProjection
    Annotations
    @deprecated
    Deprecated

    (Since version 0.15.1) Use withColumnProjections, which uses a different glob syntax

Inherited from ParquetThrift[T]

Inherited from ParquetThriftBase[T]

Inherited from HasColumnProjection

Inherited from HasFilterPredicate

Inherited from LocalTapSource

Inherited from typed.TypedSink[T]

Inherited from SingleMappable[T]

Inherited from Mappable[T]

Inherited from typed.TypedSource[T]

Inherited from DailySuffixSource

Inherited from TimePathedSource

Inherited from TimeSeqPathedSource

Inherited from FileSource

Inherited from HfsTapProvider

Inherited from LocalSourceOverride

Inherited from SchemedSource

Inherited from Source

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped