Aggregate the values of each key with Aggregator.
Aggregate the values of each key with Aggregator. First
each value V
is mapped to A
, then we reduce with a
Semigroup of A
, then finally we present the results as
U
. This could be more powerful and better optimized in some cases.
Aggregate the values of each key, using given combine functions and a neutral "zero value".
Aggregate the values of each key, using given combine functions and a neutral "zero value".
This function can return a different result type, U
, than the type of the values in this
SCollection, V
. Thus, we need one operation for merging a V
into a U
and one operation
for merging two U
's. To avoid memory allocation, both of these functions are allowed to
modify and return their first argument instead of creating a new
U.
Apply a DoFn that processes KVs and wrap the output in an SCollection.
For each key, compute the values' data distribution using approximate N
-tiles.
For each key, compute the values' data distribution using approximate N
-tiles.
a new SCollection whose values are Iterable
s of the approximate N
-tiles of
the elements.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs. It is required that each
key of the input be associated with a single value.
Note: the underlying map implementation is runner specific and may have performance overhead. Use asMapSingletonSideInput instead if the resulting map can fit into memory.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs. It is required that each
key of the input be associated with a single value.
Currently, the resulting map is required to fit into memory. This is preferable to asMapSideInput if that's the case.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs. In contrast to
asMapSideInput, it is not required that the keys in the input collection be unique.
Note: the underlying map implementation is runner specific and may have performance overhead. Use asMultiMapSingletonSideInput instead if the resulting map can fit into memory.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs.
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs. In contrast to
asMapSingletonSideInput, it is not required that the keys in the input collection be
unique.
Currently, the resulting map is required to fit into memory. This is preferable to asMultiMapSideInput if that's the case.
Batches inputs to a desired batch size.
Batches inputs to a desired batch size. Batches will contain only elements of a single key.
Elements are buffered until there are batchSize elements buffered, at which point they are outputed to the output SCollection.
Windows are preserved (batches contain elements from the same window). Batches may contain elements from more than one bundle.
For each key k in this
or rhs1
or rhs2
or rhs3
, return a resulting SCollection
that contains a tuple with the list of values for that key in this
, rhs1
, rhs2
and
rhs3
.
For each key k in this
or rhs1
or rhs2
, return a resulting SCollection that contains
a tuple with the list of values for that key in this
, rhs1
and rhs2
.
For each key k in this
or rhs
, return a resulting SCollection that contains a tuple with
the list of values for that key in this
as well as rhs
.
Generic function to combine the elements for each key using a custom set of aggregation functions.
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an SCollection[(K, V)]
into a result of type SCollection[(K, C)]
, for a
"combined type" C
Note that V
and C
can be different -- for example, one might group an
SCollection of type (Int, Int)
into an SCollection of type (Int, Seq[Int])
. Users provide
three functions:
- createCombiner
, which turns a V
into a C
(e.g., creates a one-element list)
- mergeValue
, to merge a V
into a C
(e.g., adds it to the end of a list)
- mergeCombiners
, to combine two C
's into a single one.
Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
the maximum estimation error, which should be in the range
[0.01, 0.5]
.
Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
the number of entries in the statistical sample; the higher this number, the
more accurate the estimate will be; should be >= 16
.
Count the number of elements for each key.
Count the number of elements for each key.
a new SCollection of (key, count) pairs
Return a new SCollection of (key, value) pairs without duplicates based on the keys.
Return a new SCollection of (key, value) pairs without duplicates based on the keys. The value is taken randomly for each key.
a new SCollection of (key, value) pairs
Pass each value in the key-value pair SCollection through a filter
function without
changing the keys.
Pass each value in the key-value pair SCollection through a flatMap
function without
changing the keys.
Return an SCollection having its values flattened.
Fold by key with Monoid, which defines the associative
function and "zero value" for V
.
Fold by key with Monoid, which defines the associative
function and "zero value" for V
. This could be more powerful and better optimized in some
cases.
Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
Perform a full outer join of this
and rhs
.
Perform a full outer join of this
and rhs
. For each element (k, v) in this
, the
resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for w in rhs
,
or the pair (k, (Some(v), None)) if no elements in rhs
have key k. Similarly, for each
element (k, w) in rhs
, the resulting SCollection will either contain all pairs (k,
(Some(v), Some(w))) for v in this
, or the pair (k, (None, Some(w))) if no elements in
this
have key k.
Group the values for each key in the SCollection into a single sequence.
Group the values for each key in the SCollection into a single sequence. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.
Note: As currently implemented, groupByKey
must be able to hold all the key-value pairs for
any key in memory. If a key has too many values, it can result in an OutOfMemoryError
.
Alias for cogroup
.
Alias for cogroup
.
Alias for cogroup
.
Partition this SCollection using K.hashCode() into n
partitions
Partition this SCollection using K.hashCode() into n
partitions
number of output partitions
partitioned SCollections in a Seq
Return an SCollection with the pairs from this
whose keys are in rhs
.
Return an SCollection with the pairs from this
whose keys are in rhs
.
Unlike SCollection.intersection this preserves duplicates in this
.
Return an SCollection containing all pairs of elements with matching keys in this
and
rhs
.
Return an SCollection containing all pairs of elements with matching keys in this
and
rhs
. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
this
and (k, v2) is in rhs
.
Return an SCollection with the keys of each tuple.
Perform a left outer join of this
and rhs
.
Perform a left outer join of this
and rhs
. For each element (k, v) in this
, the
resulting SCollection will either contain all pairs (k, (v, Some(w))) for w in rhs
, or the
pair (k, (v, None)) if no elements in rhs
have key k.
Pass each value in the key-value pair SCollection through a map
function without changing
the keys.
Return the max of values for each key as defined by the implicit Ordering[T]
.
Return the max of values for each key as defined by the implicit Ordering[T]
.
a new SCollection of (key, maximum value) pairs
Return the min of values for each key as defined by the implicit Ordering[T]
.
Return the min of values for each key as defined by the implicit Ordering[T]
.
a new SCollection of (key, minimum value) pairs
Merge the values for each key using an associative reduce function.
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Perform a right outer join of this
and rhs
.
Perform a right outer join of this
and rhs
. For each element (k, w) in rhs
, the
resulting SCollection will either contain all pairs (k, (Some(v), w)) for v in this
, or the
pair (k, (None, w)) if no elements in this
have key k.
Return a subset of this SCollection sampled by key (via stratified sampling).
Return a subset of this SCollection sampled by key (via stratified sampling).
Create a sample of this SCollection using variable sampling rates for different keys as
specified by fractions
, a key to sampling rate map, via simple random sampling with one
pass over the SCollection, to produce a sample of size that's approximately equal to the sum
of math.ceil(numItems * samplingRate)
over all key values.
whether to sample with or without replacement
map of specific keys to sampling rates
SCollection containing the sampled subset
Return a sampled subset of values for each key of this SCollection.
Return a sampled subset of values for each key of this SCollection.
a new SCollection of (key, sampled values) pairs
Full outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e.
Full outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e. when the intersection of keys is sparse in the left collection.
A Bloom Filter of keys from the right collection (rhs
) is used to split this
into 2
partitions. Only those with keys in the filter go through the join and the rest are
concatenated. This is useful for joining historical aggregates with incremental updates.
Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in the right collection rhs
.
This estimate is used to find the size and number of BloomFilters rhs Scio
would use to split the left collection (this
) into overlap and
intersection in a "map" step before an exact join.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
Return an SCollection with the pairs from this
whose keys are in rhs
when the cardinality of this
>> rhs
, but neither can fit in memory
(see PairHashSCollectionFunctions.hashIntersectByKey).
Return an SCollection with the pairs from this
whose keys are in rhs
when the cardinality of this
>> rhs
, but neither can fit in memory
(see PairHashSCollectionFunctions.hashIntersectByKey).
Unlike SCollection.intersection this preserves duplicates in this
.
An estimate of the number of keys in rhs
. This estimate is used to find
the size and number of BloomFilters that Scio would use to pre-filter
this
in a "map" step before any join.
Having a value close to the actual number improves the false positives
in output. When computeExact
is set to true, a more accurate estimate
of the number of keys in rhs
would mean less shuffle when finding the
exact value.
Whether or not to directly pass through bloom filter results (with a small false positive rate) or perform an additional inner join to confirm exact result set. By default this is set to false.
A fraction in range (0, 1) which would be the accepted false positive
probability for this transform. By default when computeExact
is set to
false
, this reflects the probability that an output element is an
incorrect intersect (meaning it may not be present in rhs
)
When computeExact
is set to true
, this fraction is used to find the
acceptable false positive in the intermediate step before computing exact.
Note: having fpProb = 0 doesn't mean an exact computation. This value
along with rhsNumKeys
is used for creating a BloomFilter.
Inner join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e.
Inner join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e. when the intersection of keys is sparse in the left collection.
A Bloom Filter of keys from the right collection (rhs
) is used to split this
into 2
partitions. Only those with keys in the filter go through the join and the rest are filtered
out before the join.
Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in the right collection rhs
.
This estimate is used to find the size and number of BloomFilters that Scio
would use to split the left collection (this
) into overlap and
intersection in a "map" step before an exact join.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
Left outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e.
Left outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e. when the intersection of keys is sparse in the left collection.
A Bloom Filter of keys from the right collection (rhs
) is used to split this
into 2
partitions. Only those with keys in the filter go through the join and the rest are
concatenated. This is useful for joining historical aggregates with incremental updates.
Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in the right collection rhs
.
This estimate is used to find the size and number of BloomFilters that Scio
would use to split the left collection (this
) into overlap and
intersection in a "map" step before an exact join.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
. A Bloom Filter of keys in this
is used to filter out irrelevant keys
in rhs
. This is useful when searching for a limited number of values from one or more very
large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in this
. This estimate is used to find
the size and number of BloomFilters that Scio would use to pre-filter
rhs
before doing a co-group.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
. A Bloom Filter of keys in this
is used to filter out irrelevant keys
in rhs
. This is useful when searching for a limited number of values from one or more very
large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in this
. This estimate is used to find
the size and number of BloomFilters that Scio would use to pre-filter
rhs1
and rhs2
before doing a co-group.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive
probability when discarding elements of rhs1
and rhs2
in the pre-filter
step.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
. A Bloom Filter of keys in this
is used to filter out irrelevant keys
in rhs
. This is useful when searching for a limited number of values from one or more very
large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in this
. This estimate is used to find
the size and number of BloomFilters that Scio would use to pre-filter
rhs
before doing a co-group.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
.
Look up values from rhs
where rhs
is much larger and keys from this
wont fit in memory,
and is sparse in rhs
. A Bloom Filter of keys in this
is used to filter out irrelevant keys
in rhs
. This is useful when searching for a limited number of values from one or more very
large tables. Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in this
. This estimate is used to find
the size and number of BloomFilters that Scio would use to pre-filter
rhs
before doing a co-group.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive
probability when discarding elements of rhs
in the pre-filter step.
Right outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e.
Right outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e. when the intersection of keys is sparse in the left collection.
A Bloom Filter of keys from the right collection (rhs
) is used to split this
into 2
partitions. Only those with keys in the filter go through the join and the rest are
concatenated. This is useful for joining historical aggregates with incremental updates.
Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in the right collection rhs
.
This estimate is used to find the size and number of BloomFilters that Scio
would use to split the left collection (this
) into overlap and
intersection in a "map" step before an exact join.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
Return an SCollection with the pairs from this
whose keys are not in rhs
.
Reduce by key with Semigroup.
Reduce by key with Semigroup. This could be more powerful and better optimized than reduceByKey in some cases.
Swap the keys with the values.
Return the top num
(largest) values for each key from this SCollection as defined by the
specified implicit Ordering[T]
.
Return the top num
(largest) values for each key from this SCollection as defined by the
specified implicit Ordering[T]
.
a new SCollection of (key, top num
values) pairs
Return an SCollection with the values of each tuple.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
constant value for every key
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
a function from keys to an integer N, where the key will be spread among N intermediate nodes for partial combining. If N is less than or equal to 1, this key will not be sent through an intermediate node.
Full outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e.
Full outer join for cases when the left collection (this
) is much larger than the right
collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys
as the left collection, i.e. when the intersection of keys is sparse in the left collection.
A Bloom Filter of keys from the right collection (rhs
) is used to split this
into 2
partitions. Only those with keys in the filter go through the join and the rest are
concatenated. This is useful for joining historical aggregates with incremental updates.
Read more about Bloom Filter: com.twitter.algebird.BloomFilter.
An estimate of the number of keys in the right collection rhs
.
This estimate is used to find the size and number of BloomFilters that Scio
would use to split the left collection (this
) into overlap and
intersection in a "map" step before an exact join.
Having a value close to the actual number improves the false positives
in intermediate steps which means less shuffle.
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
(Since version 0.8.0) use SCollection[(K, V)]#sparseFullOuterJoin(right, rightNumKeys) instead
Extra functions available on SCollections of (key, value) pairs through an implicit conversion.