GROUP_KEY
- Type of group keyGROUP_VALUE
- Type of values to groupAGG_VALUE
- Type of agg values to groupOUT
- Type of output object@Beta public interface ReducibleAggregator<GROUP_KEY,GROUP_VALUE,AGG_VALUE,OUT>
Aggregator
interface because this performs better.
An Aggregator
will shuffle all data across the cluster before aggregating, whereas
this will aggregate both before and after the shuffle.
This reduces the amount of data that needs to be sent over the network, as well as reducing the amount
of memory required to perform the aggregation.
For example, to aggregate and compute the average for the values, the plugin will first group all the values based
on the group key, considering it generates following splits:
Split 1: (key1, 1), (key1, 2), (key1, 3), (key2, 4)
Split 2: (key1, 2), (key1, 3), (key1, 4), (key2, 4)
Split 3: (key1, 3), (key1, 4), (key1, 5), (key2, 4)
First, the initializeValue method will be called in each split to generate an agg value with following info:
(sum: value, count: num)
The mergeValues function will be called in each split to generate following:
Split 1: (key1, sum: 6, count: 3), (key2, sum: 4, count: 1)
Split 2: (key1, sum: 9, count: 3), (key2, sum: 4, count: 1)
Split 3: (key1, sum: 12, count: 3), (key2, sum: 4, count: 1)
Data is then shuffled across the cluster such that records with the same key are handled by the same executor:
Split 4: (key1, sum: 6, count: 3), (key1, sum:9, count:3), (key1, sum:12, count:3)
Split 5: (key2, sum:4, count:1), (key2, sum:4, count:1), (key2, sum:4, count:1)
The mergePartitions function is called to generate:
Split 4: (key1, sum:27, count:9)
Split 5: (key2, sum:12, count:3)
Finally, the finalize method is called to generate the final output value(s):
Split 4: (key1, avg: 3)
Split 5: (key2, avg: 4)Modifier and Type | Method and Description |
---|---|
void |
finalize(GROUP_KEY groupKey,
AGG_VALUE groupValue,
Emitter<OUT> emitter)
Finalize the grouped object for the group key into zero or more output objects.
|
void |
groupBy(GROUP_VALUE groupValue,
Emitter<GROUP_KEY> emitter)
Emit the group key(s) for a given input value.
|
AGG_VALUE |
initializeAggregateValue(GROUP_VALUE val)
Initialize the aggregated value based on the given value.
|
AGG_VALUE |
mergePartitions(AGG_VALUE value1,
AGG_VALUE value2)
Merge the given aggregated values from each split to a final aggregated value.
|
AGG_VALUE |
mergeValues(AGG_VALUE aggValue,
GROUP_VALUE value)
Merge the given values to a single value.
|
void groupBy(GROUP_VALUE groupValue, Emitter<GROUP_KEY> emitter) throws Exception
groupValue
- the value to groupemitter
- the emitter to emit zero or more group keys for the inputException
- if there is some error getting the groupAGG_VALUE initializeAggregateValue(GROUP_VALUE val) throws Exception
val
- the value to groupException
AGG_VALUE mergeValues(AGG_VALUE aggValue, GROUP_VALUE value) throws Exception
aggValue
- the aggregated value which contains the current aggregated informationvalue
- the value to mergeException
AGG_VALUE mergePartitions(AGG_VALUE value1, AGG_VALUE value2) throws Exception
value1
- the aggregated value to mergevalue2
- the aggregated value to mergeException
void finalize(GROUP_KEY groupKey, AGG_VALUE groupValue, Emitter<OUT> emitter) throws Exception
groupKey
- the key for the groupgroupValue
- the group value associated with the group keyemitter
- the emitter to emit finalized values for the groupException
- if there is some error aggregatingCopyright © 2021 Cask Data, Inc. Licensed under the Apache License, Version 2.0.