Class

com.spotify.scio.values

PairSkewedSCollectionFunctions

Related Doc: package values

Permalink

class PairSkewedSCollectionFunctions[K, V] extends AnyRef

Extra functions available on SCollections of (key, value) pairs for skwed joins through an implicit conversion.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. PairSkewedSCollectionFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new PairSkewedSCollectionFunctions(self: SCollection[(K, V)])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  10. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  11. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  12. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  13. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  14. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  15. val self: SCollection[(K, V)]

    Permalink
  16. def skewedFullOuterJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full outer join where some keys on the left hand may be hot, i.e.appear more thanhotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  17. def skewedFullOuterJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1E-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit arg0: Coder[W], hasher: CMSHasher[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedLeftJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  18. def skewedJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  19. def skewedJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1E-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit arg0: Coder[W], hasher: CMSHasher[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  20. def skewedLeftJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  21. def skewedLeftJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1E-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit arg0: Coder[W], hasher: CMSHasher[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

    Permalink

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedLeftJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  22. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  23. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  24. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  25. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Join Operations

Ungrouped