Pretty much a synonym for mapReduceMap with the methods collected into a trait.
Pretty much a synonym for mapReduceMap with the methods collected into a trait.
Approximate number of unique values
We use about m = (104/errPercent)^2 bytes of memory per key
Uses .toString.getBytes
to serialize the data so you MUST
ensure that .toString is an equivalance on your counted fields
(i.e. x.toString == y.toString
if and only if x == y
)
Approximate number of unique values
We use about m = (104/errPercent)^2 bytes of memory per key
Uses .toString.getBytes
to serialize the data so you MUST
ensure that .toString is an equivalance on your counted fields
(i.e. x.toString == y.toString
if and only if x == y
)
For each key:
10% error ~ 256 bytes 5% error ~ 1kB 2% error ~ 4kB 1% error ~ 16kB 0.5% error ~ 64kB 0.25% error ~ 256kB
uses a more stable online algorithm which should be suitable for large numbers of records
uses a more stable online algorithm which should be suitable for large numbers of records
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
This may significantly reduce performance of your job.
This may significantly reduce performance of your job. It kills the ability to do map-side aggregation.
This is count with a predicate: only counts the tuples for which
fn(tuple)
is true
This is count with a predicate: only counts the tuples for which
fn(tuple)
is true
First do "times" on each pair, then "plus" them all together.
First do "times" on each pair, then "plus" them all together.
groupBy('x) { _.dot('y,'z, 'ydotz) }
Remove the first cnt elements
Remove the first cnt elements
Drop while the predicate is true, starting at the first false, output all
Drop while the predicate is true, starting at the first false, output all
Prefer aggregateBy operations!
This is the description of this Grouping in terms of a sequence of Every operations
This is the description of this Grouping in terms of a sequence of Every operations
Prefer reduce or mapReduceMap.
Prefer reduce or mapReduceMap. foldLeft will force all work to be done on the reducers. If your function is not associative and commutative, foldLeft may be required.
Make sure init is an immutable object.
Init needs to be serializable with Kryo (because we copy it for each grouping to avoid possible errors using a mutable init object).
This cancels map side aggregation and forces everything to the reducers
Return the first, useful probably only for sorted case.
Return the first, useful probably only for sorted case.
Collect all the values into a List[T] and then operate on that list.
Collect all the values into a List[T] and then operate on that list. This fundamentally uses as much memory as it takes to store the list. This gives you the list in the reverse order it was encounted (it is built as a stack for efficiency reasons). If you care about order, call .reverse in your fn
STRONGLY PREFER TO AVOID THIS. Try reduce or plus and an O(1) memory algorithm.
Type T
is the type of the input field (input to map, T => X)
Type T
is the type of the input field (input to map, T => X)
Type X
is the intermediate type, which your reduce function operates on
(reduce is (X,X) => X)
Type U
is the final result type, (final map is: X => U)
The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.
Corresponds to a Cascading Buffer which allows you to stream through the data, keeping some, dropping, scanning, etc...
Corresponds to a Cascading Buffer which allows you to stream through the data, keeping some, dropping, scanning, etc... The iterator you are passed is lazy, and mapping will not trigger the entire evaluation. If you convert to a list (i.e. to reverse), you need to be aware that memory constraints may become an issue.
Any fields not referenced by the input fields will be aligned to the first output, and the final hadoop stream will have a length of the maximum of the output of this, and the input stream. So, if you change the length of your inputs, the other fields won't be aligned. YOU NEED TO INCLUDE ALL THE FIELDS YOU WANT TO KEEP ALIGNED IN THIS MAPPING! POB: This appears to be a Cascading design decision.
mapfn needs to be stateless. Multiple calls needs to be safe (no mutable state captured)
these will only be called if a tuple is not passed, meaning just one column
these will only be called if a tuple is not passed, meaning just one column
Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field.
Similar to the scala.collection.Iterable.mkString takes the source and destination fieldname, which should be a single field. The result will be start, each item.toString separated by sep, followed by end for convenience there several common variants below
An identity function that keeps all the tuples.
An identity function that keeps all the tuples. A hack to implement groupAll and groupRandomly.
Opposite of RichPipe.unpivot.
Opposite of RichPipe.unpivot. See SQL/Excel for more on this function converts a row-wise representation into a column-wise one.
pivot(('feature, 'value) -> ('clicks, 'impressions, 'requests))
it will find the feature named "clicks", and put the value in the column with the field named clicks.
Absent fields result in null unless a default value is provided. Unnamed output fields are ignored.
Duplicated fields will result in an error.
if you want more precision, first do a
map('value -> value) { x : AnyRef => Option(x) }
and you will have non-nulls for all present values, and Nones for values that were present but previously null. All nulls in the final output will be those truly missing. Similarly, if you want to check if there are any items present that shouldn't be:
map('feature -> 'feature) { fname : String => if (!goodFeatures(fname)) { throw new Exception("ohnoes") } else fname }
Apply an associative/commutative operation on the left field.
Apply an associative/commutative operation on the left field.
reduce(('mass,'allids)->('totalMass, 'idset)) { (left:(Double,Set[Long]),right:(Double,Set[Long])) => (left._1 + right._1, left._2 ++ right._2) }
Equivalent to a mapReduceMap with trivial (identity) map functions.
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
The previous output goes into the reduce function on the left, like foldLeft, so if your operation is faster for the accumulator to be on one side, be aware.
Override the number of reducers used in the groupBy.
Analog of standard scanLeft (@see scala.collection.Iterable.scanLeft ) This invalidates map-side aggregation, forces all data to be transferred to reducers.
Analog of standard scanLeft (@see scala.collection.Iterable.scanLeft ) This invalidates map-side aggregation, forces all data to be transferred to reducers. Use only if you REALLY have to.
Make sure init is an immutable object.
init needs to be serializable with Kryo (because we copy it for each grouping to avoid possible errors using a mutable init object). We override the default implementation here to use Kryo to serialize the initial value, for immutable serializable inits, this is not needed
Override the description to be used in .dot and MR step names.
How many values are there for this key
How many values are there for this key
Compute the count, ave and standard deviation in one pass example: g.sizeAveStdev('x -> ('cntx, 'avex, 'stdevx))
Compute the count, ave and standard deviation in one pass example: g.sizeAveStdev('x -> ('cntx, 'avex, 'stdevx))
This invalidates aggregateBy!
This invalidates aggregateBy!
Equivalent to sorting by a comparison function then take-ing k items.
Equivalent to sorting by a comparison function then take-ing k items. This is MUCH more efficient than doing a total sort followed by a take, since these bounded sorts are done on the mapper, so only a sort of size k is needed.
sortWithTake( ('clicks, 'tweet) -> 'topClicks, 5) { fn : (t0 :(Long,Long), t1:(Long,Long) => t0._1 < t1._1 }
topClicks will be a List[(Long,Long)]
Reverse of above when the implicit ordering makes sense.
Reverse of above when the implicit ordering makes sense.
Same as above but useful when the implicit ordering makes sense.
Same as above but useful when the implicit ordering makes sense.
Override the spill threshold on AggregateBy
The same as sum(fs -> fs)
Assumed to be a commutative operation.
The same as sum(fs -> fs)
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
Use Semigroup.plus
to compute a sum.
Use Semigroup.plus
to compute a sum. Not called sum to avoid conflicting with standard sum
Your Semigroup[T]
should be associated and commutative, else this doesn't make sense
Assumed to be a commutative operation. If you don't want that, use .forceToReducers
Only keep the first cnt elements
Only keep the first cnt elements
Take while the predicate is true, stopping at the first false.
Take while the predicate is true, stopping at the first false. Output all taken elements.
This is convenience method to allow plugging in blocks
of group operations similar to RichPipe.thenDo
The same as times(fs -> fs)
The same as times(fs -> fs)
Returns the product of all the items in this grouping
Returns the product of all the items in this grouping
Convert a subset of fields into a list of Tuples.
Convert a subset of fields into a list of Tuples. Need to provide the types of the tuple fields.
beginning of block with access to expensive nonserializable state.
beginning of block with access to expensive nonserializable state. The state object should contain a function release() for resource management purpose.
This controls the sequence of reductions that happen inside a particular grouping operation. Not all elements can be combined, for instance, a scanLeft/foldLeft generally requires a sorting but such sorts are (at least for now) incompatible with doing a combine which includes some map-side reductions.