Class

com.spotify.scio.values

PairSCollectionFunctions

Related Doc: package values

Permalink

class PairSCollectionFunctions[K, V] extends AnyRef

Extra functions available on SCollections of (key, value) pairs through an implicit conversion.

Linear Supertypes
AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. PairSCollectionFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new PairSCollectionFunctions(self: SCollection[(K, V)])(implicit ctKey: ClassTag[K], ctValue: ClassTag[V])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def aggregateByKey[A, U](aggregator: Aggregator[V, A, U])(implicit arg0: ClassTag[A], arg1: ClassTag[U]): SCollection[(K, U)]

    Permalink

    Aggregate the values of each key with Aggregator.

    Aggregate the values of each key with Aggregator. First each value V is mapped to A, then we reduce with a semigroup of A, then finally we present the results as U. This could be more powerful and better optimized in some cases.

  5. def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): SCollection[(K, U)]

    Permalink

    Aggregate the values of each key, using given combine functions and a neutral "zero value".

    Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this SCollection, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

  6. def approxQuantilesByKey(numQuantiles: Int)(implicit ord: Ordering[V]): SCollection[(K, Iterable[V])]

    Permalink

    For each key, compute the values' data distribution using approximate N-tiles.

    For each key, compute the values' data distribution using approximate N-tiles.

    returns

    a new SCollection whose values are Iterables of the approximate N-tiles of the elements.

  7. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  8. def asMapSideInput: SideInput[Map[K, V]]

    Permalink

    Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs.

    Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs. It is required that each key of the input be associated with a single value.

  9. def asMultiMapSideInput: SideInput[Map[K, Iterable[V]]]

    Permalink

    Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs.

    Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs. It is not required that the keys in the input collection be unique.

  10. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. def cogroup[W1, W2, W3](that1: SCollection[(K, W1)], that2: SCollection[(K, W2)], that3: SCollection[(K, W3)])(implicit arg0: ClassTag[W1], arg1: ClassTag[W2], arg2: ClassTag[W3]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

    Permalink

    For each key k in this or that1 or that2 or that3, return a resulting SCollection that contains a tuple with the list of values for that key in this, that1, that2 and that3.

  12. def cogroup[W1, W2](that1: SCollection[(K, W1)], that2: SCollection[(K, W2)])(implicit arg0: ClassTag[W1], arg1: ClassTag[W2]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

    Permalink

    For each key k in this or that1 or that2, return a resulting SCollection that contains a tuple with the list of values for that key in this, that1 and that2.

  13. def cogroup[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (Iterable[V], Iterable[W]))]

    Permalink

    For each key k in this or that, return a resulting SCollection that contains a tuple with the list of values for that key in this as well as that.

  14. def combineByKey[C](createCombiner: (V) ⇒ C)(mergeValue: (C, V) ⇒ C)(mergeCombiners: (C, C) ⇒ C)(implicit arg0: ClassTag[C]): SCollection[(K, C)]

    Permalink

    Generic function to combine the elements for each key using a custom set of aggregation functions.

    Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an SCollection[(K, V)] into a result of type SCollection[(K, C)], for a "combined type" C Note that V and C can be different -- for example, one might group an SCollection of type (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:

    - createCombiner, which turns a V into a C (e.g., creates a one-element list)

    - mergeValue, to merge a V into a C (e.g., adds it to the end of a list)

    - mergeCombiners, to combine two C's into a single one.

  15. def countApproxDistinctByKey(maximumEstimationError: Double = 0.02): SCollection[(K, Long)]

    Permalink

    Count approximate number of distinct values for each key in the SCollection.

    Count approximate number of distinct values for each key in the SCollection.

    maximumEstimationError

    the maximum estimation error, which should be in the range [0.01, 0.5].

  16. def countApproxDistinctByKey(sampleSize: Int): SCollection[(K, Long)]

    Permalink

    Count approximate number of distinct values for each key in the SCollection.

    Count approximate number of distinct values for each key in the SCollection.

    sampleSize

    the number of entries in the statisticalsample; the higher this number, the more accurate the estimate will be; should be >= 16.

  17. def countByKey: SCollection[(K, Long)]

    Permalink

    Count the number of elements for each key.

    Count the number of elements for each key.

    returns

    a new SCollection of (key, count) pairs

  18. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  20. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. def flatMapValues[U](f: (V) ⇒ TraversableOnce[U])(implicit arg0: ClassTag[U]): SCollection[(K, U)]

    Permalink

    Pass each value in the key-value pair SCollection through a flatMap function without changing the keys.

  22. def foldByKey(implicit mon: Monoid[V]): SCollection[(K, V)]

    Permalink

    Fold by key with Monoid, which defines the associative function and "zero value" for V.

    Fold by key with Monoid, which defines the associative function and "zero value" for V. This could be more powerful and better optimized in some cases.

  23. def foldByKey(zeroValue: V)(op: (V, V) ⇒ V): SCollection[(K, V)]

    Permalink

    Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).

  24. def fullOuterJoin[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (Option[V], Option[W]))]

    Permalink

    Perform a full outer join of this and that.

    Perform a full outer join of this and that. For each element (k, v) in this, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for w in that, or the pair (k, (Some(v), None)) if no elements in that have key k. Similarly, for each element (k, w) in that, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for v in this, or the pair (k, (None, Some(w))) if no elements in this have key k. Uses the given Partitioner to partition the output SCollection.

  25. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  26. def groupByKey: SCollection[(K, Iterable[V])]

    Permalink

    Group the values for each key in the SCollection into a single sequence.

    Group the values for each key in the SCollection into a single sequence. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.

    Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.

    Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.

  27. def groupWith[W1, W2, W3](that1: SCollection[(K, W1)], that2: SCollection[(K, W2)], that3: SCollection[(K, W3)])(implicit arg0: ClassTag[W1], arg1: ClassTag[W2], arg2: ClassTag[W3]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

    Permalink

    Alias for cogroup.

  28. def groupWith[W1, W2](that1: SCollection[(K, W1)], that2: SCollection[(K, W2)])(implicit arg0: ClassTag[W1], arg1: ClassTag[W2]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

    Permalink

    Alias for cogroup.

  29. def groupWith[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (Iterable[V], Iterable[W]))]

    Permalink

    Alias for cogroup.

  30. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  31. def hashJoin[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (V, W))]

    Permalink

    Perform an inner join by replicating that to all workers.

    Perform an inner join by replicating that to all workers. The right side should be tiny and fit in memory.

  32. def hashLeftJoin[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (V, Option[W]))]

    Permalink

    Perform a left outer join by replicating that to all workers.

    Perform a left outer join by replicating that to all workers. The right side should be tiny and fit in memory.

  33. def intersectByKey(that: SCollection[K]): SCollection[(K, V)]

    Permalink

    Return an SCollection with the pairs from this whose keys are in that.

  34. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  35. def join[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (V, W))]

    Permalink

    Return an SCollection containing all pairs of elements with matching keys in this and that.

    Return an SCollection containing all pairs of elements with matching keys in this and that. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in that. Uses the given Partitioner to partition the output RDD.

  36. def keys: SCollection[K]

    Permalink

    Return an SCollection with the keys of each tuple.

  37. def leftOuterJoin[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (V, Option[W]))]

    Permalink

    Perform a left outer join of this and that.

    Perform a left outer join of this and that. For each element (k, v) in this, the resulting SCollection will either contain all pairs (k, (v, Some(w))) for w in that, or the pair (k, (v, None)) if no elements in that have key k. Uses the given Partitioner to partition the output SCollection.

  38. def mapValues[U](f: (V) ⇒ U)(implicit arg0: ClassTag[U]): SCollection[(K, U)]

    Permalink

    Pass each value in the key-value pair SCollection through a map function without changing the keys.

  39. def maxByKey(implicit ord: Ordering[V]): SCollection[(K, V)]

    Permalink

    Return the max of values for each key as defined by the implicit Ordering[T].

    Return the max of values for each key as defined by the implicit Ordering[T].

    returns

    a new SCollection of (key, maximum value) pairs

  40. def minByKey(implicit ord: Ordering[V]): SCollection[(K, V)]

    Permalink

    Return the min of values for each key as defined by the implicit Ordering[T].

    Return the min of values for each key as defined by the implicit Ordering[T].

    returns

    a new SCollection of (key, minimum value) pairs

  41. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  42. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  43. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  44. def reduceByKey(op: (V, V) ⇒ V): SCollection[(K, V)]

    Permalink

    Merge the values for each key using an associative reduce function.

    Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

  45. def rightOuterJoin[W](that: SCollection[(K, W)])(implicit arg0: ClassTag[W]): SCollection[(K, (Option[V], W))]

    Permalink

    Perform a right outer join of this and that.

    Perform a right outer join of this and that. For each element (k, w) in that, the resulting SCollection will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Uses the given Partitioner to partition the output SCollection.

  46. def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double]): SCollection[(K, V)]

    Permalink

    Return a subset of this SCollection sampled by key (via stratified sampling).

    Return a subset of this SCollection sampled by key (via stratified sampling).

    Create a sample of this SCollection using variable sampling rates for different keys as specified by fractions, a key to sampling rate map, via simple random sampling with one pass over the SCollection, to produce a sample of size that's approximately equal to the sum of math.ceil(numItems * samplingRate) over all key values.

    withReplacement

    whether to sample with or without replacement

    fractions

    map of specific keys to sampling rates

    returns

    SCollection containing the sampled subset

  47. def sampleByKey(sampleSize: Int): SCollection[(K, Iterable[V])]

    Permalink

    Return a sampled subset of values for each key of this SCollection.

    Return a sampled subset of values for each key of this SCollection.

    returns

    a new SCollection of (key, sampled values) pairs

  48. val self: SCollection[(K, V)]

    Permalink
  49. def skewedJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]])(implicit arg0: ClassTag[W]): SCollection[(K, (V, W))]

    Permalink

    N to 1 skewproof flavor of PairSCollectionFunctions.join().

    N to 1 skewproof flavor of PairSCollectionFunctions.join().

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS -> com.twitter.algebird.CMSMonoid

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join

  50. def skewedJoin[W](that: SCollection[(K, W)], hotKeyThreshold: Long, eps: Double, seed: Int, delta: Double = 1E-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit arg0: ClassTag[W], hasher: CMSHasher[K]): SCollection[(K, (V, W))]

    Permalink

    N to 1 skewproof flavor of PairSCollectionFunctions.join().

    N to 1 skewproof flavor of PairSCollectionFunctions.join().

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample()

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, eps=0.0005, seed=1)

      Read more about CMS -> com.twitter.algebird.CMSMonoid

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join

  51. def subtractByKey(that: SCollection[K]): SCollection[(K, V)]

    Permalink

    Return an SCollection with the pairs from this whose keys are not in that.

  52. def sumByKey(implicit sg: Semigroup[V]): SCollection[(K, V)]

    Permalink

    Reduce by key with Semigroup.

    Reduce by key with Semigroup. This could be more powerful and better optimized in some cases.

  53. def swap: SCollection[(V, K)]

    Permalink

    Swap the keys with the values.

  54. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  55. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  56. def topByKey(num: Int)(implicit ord: Ordering[V]): SCollection[(K, Iterable[V])]

    Permalink

    Return the top k (largest) values for each key from this SCollection as defined by the specified implicit Ordering[T].

    Return the top k (largest) values for each key from this SCollection as defined by the specified implicit Ordering[T].

    returns

    a new SCollection of (key, top k) pairs

  57. def values: SCollection[V]

    Permalink

    Return an SCollection with the values of each tuple.

  58. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  59. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  60. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  61. def withHotKeyFanout(hotKeyFanout: Int): SCollectionWithHotKeyFanout[K, V]

    Permalink

    Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.

    Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.

    hotKeyFanout

    constant value for every key

  62. def withHotKeyFanout(hotKeyFanout: (K) ⇒ Int): SCollectionWithHotKeyFanout[K, V]

    Permalink

    Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.

    Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.

    hotKeyFanout

    a function from keys to an integer N, where the key will be spread among N intermediate nodes for partial combining. If N is less than or equal to 1, this key will not be sent through an intermediate node.

Inherited from AnyRef

Inherited from Any

CoGroup Operations

Join Operations

Per Key Aggregations

Transformations

Other Members