PairSCollectionFunctions

Instance Constructors

new PairSCollectionFunctions(self: SCollection[(K, V)])

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def aggregateByKey[A, U](aggregator: Aggregator[V, A, U])(implicit arg0: Coder[A], arg1: Coder[U], koder: Coder[K], voder: Coder[V]): SCollection[(K, U)]

Aggregate the values of each key with Aggregator.
Aggregate the values of each key with Aggregator. First each value V is mapped to A, then we reduce with a Semigroup of A, then finally we present the results as U. This could be more powerful and better optimized in some cases.
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: Coder[U], koder: Coder[K], voder: Coder[V]): SCollection[(K, U)]

Aggregate the values of each key, using given combine functions and a neutral "zero value".
Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this SCollection, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a newU.
def applyPerKeyDoFn[U](t: DoFn[KV[K, V], KV[K, U]])(implicit arg0: Coder[U], koder: Coder[K], vcoder: Coder[V]): SCollection[(K, U)]

Apply a DoFn that processes KVs and wrap the output in an SCollection.
def approxQuantilesByKey(numQuantiles: Int)(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, Iterable[V])]
def approxQuantilesByKey(numQuantiles: Int, ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

For each key, compute the values' data distribution using approximate N-tiles.
For each key, compute the values' data distribution using approximate N-tiles.
returns
a new SCollection whose values are Iterables of the approximate N-tiles of the elements.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def asMapSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, V]]

Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs. It is required that each key of the input be associated with a single value.
Note: the underlying map implementation is runner specific and may have performance overhead. Use asMapSingletonSideInput instead if the resulting map can fit into memory.
def asMapSingletonSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, V]]

Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, value], to be used with SCollection.withSideInputs. It is required that each key of the input be associated with a single value.
Currently, the resulting map is required to fit into memory. This is preferable to asMapSideInput if that's the case.
def asMultiMapSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, Iterable[V]]]

Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs. In contrast to asMapSideInput, it is not required that the keys in the input collection be unique.
Note: the underlying map implementation is runner specific and may have performance overhead. Use asMultiMapSingletonSideInput instead if the resulting map can fit into memory.
def asMultiMapSingletonSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, Iterable[V]]]

Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a Map[key, Iterable[value]], to be used with SCollection.withSideInputs. In contrast to asMapSingletonSideInput, it is not required that the keys in the input collection be unique.
Currently, the resulting map is required to fit into memory. This is preferable to asMultiMapSideInput if that's the case.
def batchByKey(batchSize: Long)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

Batches inputs to a desired batch size.
Batches inputs to a desired batch size. Batches will contain only elements of a single key.
Elements are buffered until there are batchSize elements buffered, at which point they are outputed to the output SCollection.
Windows are preserved (batches contain elements from the same window). Batches may contain elements from more than one bundle.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def cogroup[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)])(implicit arg0: Coder[W1], arg1: Coder[W2], arg2: Coder[W3], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

For each key k in this or rhs1 or rhs2 or rhs3, return a resulting SCollection that contains a tuple with the list of values for that key in this, rhs1, rhs2 and rhs3.
def cogroup[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)])(implicit arg0: Coder[W1], arg1: Coder[W2], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

For each key k in this or rhs1 or rhs2, return a resulting SCollection that contains a tuple with the list of values for that key in this, rhs1 and rhs2.
def cogroup[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W]))]

For each key k in this or rhs, return a resulting SCollection that contains a tuple with the list of values for that key in this as well as rhs.
def combineByKey[C](createCombiner: (V) ⇒ C)(mergeValue: (C, V) ⇒ C)(mergeCombiners: (C, C) ⇒ C)(implicit arg0: Coder[C], koder: Coder[K], voder: Coder[V]): SCollection[(K, C)]

Generic function to combine the elements for each key using a custom set of aggregation functions.
Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an SCollection[(K, V)] into a result of type SCollection[(K, C)], for a "combined type" C Note that V and C can be different -- for example, one might group an SCollection of type (Int, Int) into an SCollection of type (Int, Seq[Int]). Users provide three functions:
- createCombiner, which turns a V into a C (e.g., creates a one-element list)
- mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
- mergeCombiners, to combine two C's into a single one.
def countApproxDistinctByKey(maximumEstimationError: Double = 0.02)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
maximumEstimationError
the maximum estimation error, which should be in the range [0.01, 0.5].
def countApproxDistinctByKey(sampleSize: Int)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
sampleSize
the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be >= 16.
def countByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

Count the number of elements for each key.
Count the number of elements for each key.
returns
a new SCollection of (key, count) pairs
def distinctByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Return a new SCollection of (key, value) pairs without duplicates based on the keys.
Return a new SCollection of (key, value) pairs without duplicates based on the keys. The value is taken randomly for each key.
returns
a new SCollection of (key, value) pairs
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def filterValues(f: (V) ⇒ Boolean)(implicit koder: Coder[K]): SCollection[(K, V)]

Pass each value in the key-value pair SCollection through a filter function without changing the keys.
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def flatMapValues[U](f: (V) ⇒ TraversableOnce[U])(implicit arg0: Coder[U], koder: Coder[K]): SCollection[(K, U)]

Pass each value in the key-value pair SCollection through a flatMap function without changing the keys.
def flattenValues[U](implicit arg0: Coder[U], ev: <:<[V, TraversableOnce[U]], koder: Coder[K]): SCollection[(K, U)]

Return an SCollection having its values flattened.
def foldByKey(implicit mon: Monoid[V], koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Fold by key with Monoid, which defines the associative function and "zero value" for V.
Fold by key with Monoid, which defines the associative function and "zero value" for V. This could be more powerful and better optimized in some cases.
def foldByKey(zeroValue: V)(op: (V, V) ⇒ V)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
def fullOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

Perform a full outer join of this and rhs.
Perform a full outer join of this and rhs. For each element (k, v) in this, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for w in rhs, or the pair (k, (Some(v), None)) if no elements in rhs have key k. Similarly, for each element (k, w) in rhs, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for v in this, or the pair (k, (None, Some(w))) if no elements in this have key k.
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def groupByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

Group the values for each key in the SCollection into a single sequence.
Group the values for each key in the SCollection into a single sequence. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.
Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.
def groupWith[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)])(implicit arg0: Coder[W1], arg1: Coder[W2], arg2: Coder[W3], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

Alias for cogroup.
def groupWith[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)])(implicit arg0: Coder[W1], arg1: Coder[W2], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

Alias for cogroup.
def groupWith[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W]))]

Alias for cogroup.
def hashCode(): Int

Definition Classes
AnyRef → Any
def hashPartitionByKey(numPartitions: Int): Seq[SCollection[(K, V)]]

Partition this SCollection using K.hashCode() into n partitions
Partition this SCollection using K.hashCode() into n partitions
numPartitions
number of output partitions
returns
partitioned SCollections in a Seq
def intersectByKey(rhs: SCollection[K])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Return an SCollection with the pairs from this whose keys are in rhs.
Return an SCollection with the pairs from this whose keys are in rhs.
Unlike SCollection.intersection this preserves duplicates in this.
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def join[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

Return an SCollection containing all pairs of elements with matching keys in this and rhs.
Return an SCollection containing all pairs of elements with matching keys in this and rhs. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in rhs.
def keys(implicit koder: Coder[K], voder: Coder[V]): SCollection[K]

Return an SCollection with the keys of each tuple.
def leftOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

Perform a left outer join of this and rhs.
Perform a left outer join of this and rhs. For each element (k, v) in this, the resulting SCollection will either contain all pairs (k, (v, Some(w))) for w in rhs, or the pair (k, (v, None)) if no elements in rhs have key k.
def mapValues[U](f: (V) ⇒ U)(implicit arg0: Coder[U], koder: Coder[K]): SCollection[(K, U)]

Pass each value in the key-value pair SCollection through a map function without changing the keys.
def maxByKey(ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]
def maxByKey(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, V)]

Return the max of values for each key as defined by the implicit Ordering[T].
Return the max of values for each key as defined by the implicit Ordering[T].
returns
a new SCollection of (key, maximum value) pairs
def minByKey(ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]
def minByKey(implicit koder: Coder[K], voder: Coder[V], ord: Ordering[V]): SCollection[(K, V)]

Return the min of values for each key as defined by the implicit Ordering[T].
Return the min of values for each key as defined by the implicit Ordering[T].
returns
a new SCollection of (key, minimum value) pairs
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def reduceByKey(op: (V, V) ⇒ V)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Merge the values for each key using an associative reduce function.
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
def rightOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], W))]

Perform a right outer join of this and rhs.
Perform a right outer join of this and rhs. For each element (k, w) in rhs, the resulting SCollection will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k.
def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Return a subset of this SCollection sampled by key (via stratified sampling).
Return a subset of this SCollection sampled by key (via stratified sampling).
Create a sample of this SCollection using variable sampling rates for different keys as specified by fractions, a key to sampling rate map, via simple random sampling with one pass over the SCollection, to produce a sample of size that's approximately equal to the sum of math.ceil(numItems * samplingRate) over all key values.
withReplacement
whether to sample with or without replacement
fractions
map of specific keys to sampling rates
returns
SCollection containing the sampled subset
def sampleByKey(sampleSize: Int)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

Return a sampled subset of values for each key of this SCollection.
Return a sampled subset of values for each key of this SCollection.
returns
a new SCollection of (key, sampled values) pairs
val self: SCollection[(K, V)]
def sparseFullOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

Full outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.
Full outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs) is used to split this into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
rhsNumKeys
An estimate of the number of keys in the right collection rhs. This estimate is used to find the size and number of BloomFilters rhs Scio would use to split the left collection (this) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
def sparseIntersectByKey(rhs: SCollection[K], rhsNumKeys: Long, computeExact: Boolean = false, fpProb: Double = 0.01)(implicit koder: Coder[K], voder: Coder[V], hash: Hash128[K]): SCollection[(K, V)]

Return an SCollection with the pairs from this whose keys are in rhs when the cardinality of this >> rhs, but neither can fit in memory (see PairHashSCollectionFunctions.hashIntersectByKey).
Return an SCollection with the pairs from this whose keys are in rhs when the cardinality of this >> rhs, but neither can fit in memory (see PairHashSCollectionFunctions.hashIntersectByKey).
Unlike SCollection.intersection this preserves duplicates in this.
rhsNumKeys
An estimate of the number of keys in rhs. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filter this in a "map" step before any join. Having a value close to the actual number improves the false positives in output. When computeExact is set to true, a more accurate estimate of the number of keys in rhs would mean less shuffle when finding the exact value.
computeExact
Whether or not to directly pass through bloom filter results (with a small false positive rate) or perform an additional inner join to confirm exact result set. By default this is set to false.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability for this transform. By default when computeExact is set to false, this reflects the probability that an output element is an incorrect intersect (meaning it may not be present in rhs) When computeExact is set to true, this fraction is used to find the acceptable false positive in the intermediate step before computing exact. Note: having fpProb = 0 doesn't mean an exact computation. This value along with rhsNumKeys is used for creating a BloomFilter.
def sparseJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

Inner join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.
Inner join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs) is used to split this into 2 partitions. Only those with keys in the filter go through the join and the rest are filtered out before the join. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
rhsNumKeys
An estimate of the number of keys in the right collection rhs. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
def sparseLeftOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

Left outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.
Left outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs) is used to split this into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
rhsNumKeys
An estimate of the number of keys in the right collection rhs. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long)(implicit arg0: Coder[A], arg1: Coder[B], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A], Iterable[B]))]

Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs.
Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs. A Bloom Filter of keys in this is used to filter out irrelevant keys in rhs. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
thisNumKeys
An estimate of the number of keys in this. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filter rhs before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long, fpProb: Double)(implicit arg0: Coder[A], arg1: Coder[B], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A], Iterable[B]))]

Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs.
Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs. A Bloom Filter of keys in this is used to filter out irrelevant keys in rhs. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
thisNumKeys
An estimate of the number of keys in this. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filter rhs1 and rhs2 before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when discarding elements of rhs1 and rhs2 in the pre-filter step.
def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long)(implicit arg0: Coder[A], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A]))]

Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs.
Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs. A Bloom Filter of keys in this is used to filter out irrelevant keys in rhs. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
thisNumKeys
An estimate of the number of keys in this. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filter rhs before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long, fpProb: Double)(implicit arg0: Coder[A], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A]))]

Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs.
Look up values from rhs where rhs is much larger and keys from this wont fit in memory, and is sparse in rhs. A Bloom Filter of keys in this is used to filter out irrelevant keys in rhs. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
thisNumKeys
An estimate of the number of keys in this. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filter rhs before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when discarding elements of rhs in the pre-filter step.
def sparseRightOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], W))]

Right outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.
Right outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs) is used to split this into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
rhsNumKeys
An estimate of the number of keys in the right collection rhs. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
def subtractByKey(rhs: SCollection[K])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Return an SCollection with the pairs from this whose keys are not in rhs.
def sumByKey(sg: Semigroup[V])(implicit koder: Coder[K], voder: Coder[V], d: DummyImplicit): SCollection[(K, V)]
def sumByKey(implicit sg: Semigroup[V], koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

Reduce by key with Semigroup.
Reduce by key with Semigroup. This could be more powerful and better optimized than reduceByKey in some cases.
def swap(implicit koder: Coder[K], voder: Coder[V]): SCollection[(V, K)]

Swap the keys with the values.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def topByKey(num: Int, ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]
def topByKey(num: Int)(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, Iterable[V])]

Return the top num (largest) values for each key from this SCollection as defined by the specified implicit Ordering[T].
Return the top num (largest) values for each key from this SCollection as defined by the specified implicit Ordering[T].
returns
a new SCollection of (key, top num values) pairs
def values(implicit voder: Coder[V]): SCollection[V]

Return an SCollection with the values of each tuple.
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def withHotKeyFanout(hotKeyFanout: Int)(implicit koder: Coder[K], voder: Coder[V]): SCollectionWithHotKeyFanout[K, V]

Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
hotKeyFanout
constant value for every key
def withHotKeyFanout(hotKeyFanout: (K) ⇒ Int)(implicit koder: Coder[K], voder: Coder[V]): SCollectionWithHotKeyFanout[K, V]

Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
hotKeyFanout
a function from keys to an integer N, where the key will be spread among N intermediate nodes for partial combining. If N is less than or equal to 1, this key will not be sent through an intermediate node.

Deprecated Value Members

def sparseOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

Full outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.
Full outer join for cases when the left collection (this) is much larger than the right collection (rhs) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs) is used to split this into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
rhsNumKeys
An estimate of the number of keys in the right collection rhs. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.

Annotations
@deprecated
Deprecated
(Since version 0.8.0) use SCollection[(K, V)]#sparseFullOuterJoin(right, rightNumKeys) instead

Related Doc: package values

class PairSCollectionFunctions[K, V] extends AnyRef

Instance Constructors

new PairSCollectionFunctions(self: SCollection[(K, V)])

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def aggregateByKey[A, U](aggregator: Aggregator[V, A, U])(implicit arg0: Coder[A], arg1: Coder[U], koder: Coder[K], voder: Coder[V]): SCollection[(K, U)]

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: Coder[U], koder: Coder[K], voder: Coder[V]): SCollection[(K, U)]

def applyPerKeyDoFn[U](t: DoFn[KV[K, V], KV[K, U]])(implicit arg0: Coder[U], koder: Coder[K], vcoder: Coder[V]): SCollection[(K, U)]

def approxQuantilesByKey(numQuantiles: Int)(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, Iterable[V])]

def approxQuantilesByKey(numQuantiles: Int, ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

final def asInstanceOf[T0]: T0

def asMapSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, V]]

def asMapSingletonSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, V]]

def asMultiMapSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, Iterable[V]]]

def asMultiMapSingletonSideInput(implicit koder: Coder[K], voder: Coder[V]): SideInput[Map[K, Iterable[V]]]

def batchByKey(batchSize: Long)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

def clone(): AnyRef

def cogroup[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)])(implicit arg0: Coder[W1], arg1: Coder[W2], arg2: Coder[W3], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

def cogroup[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)])(implicit arg0: Coder[W1], arg1: Coder[W2], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def cogroup[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W]))]

def combineByKey[C](createCombiner: (V) ⇒ C)(mergeValue: (C, V) ⇒ C)(mergeCombiners: (C, C) ⇒ C)(implicit arg0: Coder[C], koder: Coder[K], voder: Coder[V]): SCollection[(K, C)]

def countApproxDistinctByKey(maximumEstimationError: Double = 0.02)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

def countApproxDistinctByKey(sampleSize: Int)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

def countByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Long)]

def distinctByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def filterValues(f: (V) ⇒ Boolean)(implicit koder: Coder[K]): SCollection[(K, V)]

def finalize(): Unit

def flatMapValues[U](f: (V) ⇒ TraversableOnce[U])(implicit arg0: Coder[U], koder: Coder[K]): SCollection[(K, U)]

def flattenValues[U](implicit arg0: Coder[U], ev: <:<[V, TraversableOnce[U]], koder: Coder[K]): SCollection[(K, U)]

def foldByKey(implicit mon: Monoid[V], koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def foldByKey(zeroValue: V)(op: (V, V) ⇒ V)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def fullOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

final def getClass(): Class[_]

def groupByKey(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

def groupWith[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)])(implicit arg0: Coder[W1], arg1: Coder[W2], arg2: Coder[W3], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

def groupWith[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)])(implicit arg0: Coder[W1], arg1: Coder[W2], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def groupWith[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Iterable[V], Iterable[W]))]

def hashCode(): Int

def hashPartitionByKey(numPartitions: Int): Seq[SCollection[(K, V)]]

def intersectByKey(rhs: SCollection[K])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

final def isInstanceOf[T0]: Boolean

def join[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

def keys(implicit koder: Coder[K], voder: Coder[V]): SCollection[K]

def leftOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

def mapValues[U](f: (V) ⇒ U)(implicit arg0: Coder[U], koder: Coder[K]): SCollection[(K, U)]

def maxByKey(ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def maxByKey(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, V)]

def minByKey(ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def minByKey(implicit koder: Coder[K], voder: Coder[V], ord: Ordering[V]): SCollection[(K, V)]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def reduceByKey(op: (V, V) ⇒ V)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def rightOuterJoin[W](rhs: SCollection[(K, W)])(implicit arg0: Coder[W], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], W))]

def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def sampleByKey(sampleSize: Int)(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

val self: SCollection[(K, V)]

def sparseFullOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], Option[W]))]

def sparseIntersectByKey(rhs: SCollection[K], rhsNumKeys: Long, computeExact: Boolean = false, fpProb: Double = 0.01)(implicit koder: Coder[K], voder: Coder[V], hash: Hash128[K]): SCollection[(K, V)]

def sparseJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, W))]

def sparseLeftOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Option[W]))]

def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long)(implicit arg0: Coder[A], arg1: Coder[B], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A], Iterable[B]))]

def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long, fpProb: Double)(implicit arg0: Coder[A], arg1: Coder[B], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A], Iterable[B]))]

def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long)(implicit arg0: Coder[A], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A]))]

def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long, fpProb: Double)(implicit arg0: Coder[A], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (V, Iterable[A]))]

def sparseRightOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit arg0: Coder[W], hash: Hash128[K], koder: Coder[K], voder: Coder[V]): SCollection[(K, (Option[V], W))]

def subtractByKey(rhs: SCollection[K])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def sumByKey(sg: Semigroup[V])(implicit koder: Coder[K], voder: Coder[V], d: DummyImplicit): SCollection[(K, V)]

def sumByKey(implicit sg: Semigroup[V], koder: Coder[K], voder: Coder[V]): SCollection[(K, V)]

def swap(implicit koder: Coder[K], voder: Coder[V]): SCollection[(V, K)]

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def topByKey(num: Int, ord: Ordering[V])(implicit koder: Coder[K], voder: Coder[V]): SCollection[(K, Iterable[V])]

def topByKey(num: Int)(implicit ord: Ordering[V], koder: Coder[K], voder: Coder[V], dummy: DummyImplicit): SCollection[(K, Iterable[V])]