This is the list of mapped pipes, just before the (reducing) joinFunction is applied
This is the list of mapped pipes, just before the (reducing) joinFunction is applied
This function is not type-safe for others to call, but it should never have an error.
This function is not type-safe for others to call, but it should never have an error. By construction, we never call it with incorrect types. It would be preferable to have stronger type safety here, but unclear how to achieve, and since it is an internal function, not clear it would actually help anyone for it to be type-safe
Use Algebird Aggregator to do the reduction
Use Algebird Aggregator to do the reduction
Smaller is about average values/key not total size (that does not matter, but is clearly related).
Smaller is about average values/key not total size (that does not matter, but is clearly related).
Note that from the type signature we see that the right side is iterated (or may be) over and over, but the left side is not. That means that you want the side with fewer values per key on the right. If both sides are similar, no need to worry. If one side is a one-to-one mapping, that should be the "smaller" side.
Selects all elements except first n ones.
Selects all elements except first n ones.
Drops longest prefix of elements that satisfy the given predicate.
Drops longest prefix of elements that satisfy the given predicate.
.
.filter(fn).toTypedPipe == .toTypedPipe.filter(fn) It is generally better to avoid going back to a TypedPipe as long as possible: this minimizes the times we go in and out of cascading/hadoop types.
filter keys on a predicate.
filter keys on a predicate. More efficient than filter if you are only looking at keys
This is just short hand for mapValueStream(identity), it makes sure the planner sees that you want to force a shuffle.
This is just short hand for mapValueStream(identity), it makes sure the planner sees that you want to force a shuffle. For expert tuning
Use this to get the first value encountered.
Use this to get the first value encountered. prefer this to take(1).
Operate on an Iterator[T] of all the values for each key at one time.
Operate on an Iterator[T] of all the values for each key at one time. Avoid accumulating the whole list in memory if you can. Prefer sum, which is partially executed map-side by default.
Use this when you don't care about the key for the group, otherwise use mapGroup
Use this when you don't care about the key for the group, otherwise use mapGroup
This is a special case of mapValueStream, but can be optimized because it doesn't need all the values for a given key at once.
This is a special case of mapValueStream, but can be optimized because it doesn't need all the values for a given key at once. An unoptimized implementation is: mapValueStream { _.map { fn } } but for Grouped we can avoid resorting to mapValueStream
reduce with fn which must be associative and commutative.
reduce with fn which must be associative and commutative. Like the above this can be optimized in some Grouped cases. If you don't have a commutative operator, use reduceLeft
Like the above, but with a less than operation for the ordering
Like the above, but with a less than operation for the ordering
Take the largest k things according to the implicit ordering.
Take the largest k things according to the implicit ordering. Useful for top-k without having to call ord.reverse
This implements bottom-k (smallest k items) on each mapper for each key, then sends those to reducers to get the result.
This implements bottom-k (smallest k items) on each mapper for each key, then sends those to reducers to get the result. This is faster than using .take if k * (number of Keys) is small enough to fit in memory.
If there is no ordering, we default to assuming the Semigroup is commutative.
If there is no ordering, we default to assuming the Semigroup is commutative. If you don't want that, define an ordering on the Values, or .forceToReducers.
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce/reduceLeft
Semigroups MAY have a faster implementation of sum for iterators, so prefer using sum/sumLeft to reduce/reduceLeft
Selects first n elements.
Selects first n elements. Don't use this if n == 1, head is faster in that case.
Takes longest prefix of elements that satisfy the given predicate.
Takes longest prefix of elements that satisfy the given predicate.
never mutates this, instead returns a new item.
never mutates this, instead returns a new item.