AggregateFunction is the superclass of two aggregation function interfaces:
A central moment is the expected value of a specified power of the deviation of a random variable from the mean.
A central moment is the expected value of a specified power of the deviation of a random variable from the mean. Central moments are often used to characterize the properties of about the shape of a distribution.
This class implements online, one-pass algorithms for computing the central moments of a set of points.
Behavior:
Double.NaN
when the column contains Double.NaN
valuesReferences:
The Collect aggregate function collects all seen expression values into a list of values.
The Collect aggregate function collects all seen expression values into a list of values.
The operator is bound to the slower sort based aggregation path because the number of elements (and their memory usage) can not be determined in advance. This also means that the collected elements are stored on heap, and that too many elements can cause GC pauses and eventually Out of Memory Errors.
Collect a list of elements.
Collect a list of elements.
Collect a list of unique elements.
Collect a list of unique elements.
Compute Pearson correlation between two expressions.
Compute Pearson correlation between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
Definition of Pearson correlation can be found at http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Compute the covariance between two expressions.
Compute the covariance between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
API for aggregation functions that are expressed in terms of Catalyst expressions.
API for aggregation functions that are expressed in terms of Catalyst expressions.
When implementing a new expression-based aggregate function, start by implementing
bufferAttributes
, defining attributes for the fields of the mutable aggregation buffer. You
can then use these attributes when defining updateExpressions
, mergeExpressions
, and
evaluateExpressions
.
Please note that children of an aggregate function can be unresolved (it will happen when
we create this function in DataFrame API). So, if there is any fields in
the implemented class that need to access fields of its children, please make
those fields lazy val
s.
Returns the first value of child
for a group of rows.
Returns the first value of child
for a group of rows. If the first value of child
is null
, it returns null
(respecting nulls). Even if First is used on a already
sorted column, if we do partial aggregation and final aggregation (when mergeExpression
is used) its result will not be deterministic (unless the input table is sorted and has
a single partition, and we use a single reducer to do the aggregation.).
HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm.
HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm. This class implements the dense version of the HLL++ algorithm as an Aggregate Function.
This implementation has been based on the following papers: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf
Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
to estimate the cardinality of.
the maximum estimation error allowed.
API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
Within these functions, code should access fields of the mutable aggregation buffer by adding the
bufferSchema-relative field number to mutableAggBufferOffset
then using this new field number
to access the buffer Row. This is necessary because this aggregation function's buffer is
embedded inside of a larger shared aggregation buffer when an aggregation operator evaluates
multiple aggregate functions at the same time.
We need to perform similar field number arithmetic when merging multiple intermediate
aggregate buffers together in merge()
(in this case, use inputAggBufferOffset
when accessing
the input buffer).
Correct ImperativeAggregate evaluation depends on the correctness of mutableAggBufferOffset
and
inputAggBufferOffset
, but not on the correctness of the attribute ids in aggBufferAttributes
and inputAggBufferAttributes
.
Returns the last value of child
for a group of rows.
Returns the last value of child
for a group of rows. If the last value of child
is null
, it returns null
(respecting nulls). Even if Last is used on a already
sorted column, if we do partial aggregation and final aggregation (when mergeExpression
is used) its result will not be deterministic (unless the input table is sorted and has
a single partition, and we use a single reducer to do the aggregation.).
PivotFirst is a aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
PivotFirst is a aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
For example on an input of A | B --+-- x | 1 y | 2 z | 3
with pivotColumn=A, valueColumn=B, and pivotColumnValues=[z,y] the output is [3,2].
column that determines which output position to put valueColumn in.
the column that is being rearranged.
the list of pivotColumn values in the order of desired output. Values not listed here will be ignored.
Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
See the Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality (https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen) for more information.
AggregateFunction is the superclass of two aggregation function interfaces:
In both interfaces, aggregates must define the schema (aggBufferSchema) and attributes (aggBufferAttributes) of an aggregation buffer which is used to hold partial aggregate results. At runtime, multiple aggregate functions are evaluated by the same operator using a combined aggregation buffer which concatenates the aggregation buffers of the individual aggregate functions.
Code which accepts AggregateFunction instances should be prepared to handle both types of aggregate functions.