aggregate

Type Members

sealed abstract class AggregateFunction extends Expression with ImplicitCastInputTypes

AggregateFunction is the superclass of two aggregation function interfaces:
AggregateFunction is the superclass of two aggregation function interfaces:
- ImperativeAggregate is for aggregation functions that are specified in terms of initialize(), update(), and merge() functions that operate on Row-based aggregation buffers.
- DeclarativeAggregate is for aggregation functions that are specified using Catalyst expressions.
In both interfaces, aggregates must define the schema (aggBufferSchema) and attributes (aggBufferAttributes) of an aggregation buffer which is used to hold partial aggregate results. At runtime, multiple aggregate functions are evaluated by the same operator using a combined aggregation buffer which concatenates the aggregation buffers of the individual aggregate functions.
Code which accepts AggregateFunction instances should be prepared to handle both types of aggregate functions.
case class Average(child: Expression) extends DeclarativeAggregate with Product with Serializable

Annotations
@ExpressionDescription()
abstract class CentralMomentAgg extends DeclarativeAggregate

A central moment is the expected value of a specified power of the deviation of a random variable from the mean.
A central moment is the expected value of a specified power of the deviation of a random variable from the mean. Central moments are often used to characterize the properties of about the shape of a distribution.
This class implements online, one-pass algorithms for computing the central moments of a set of points.
Behavior:
- null values are ignored
- returns Double.NaN when the column contains Double.NaN values
References:
- Xiangrui Meng. "Simpler Online Updates for Arbitrary-Order Central Moments." 2015. http://arxiv.org/abs/1510.04923
See also
Algorithms for calculating variance (Wikipedia)
abstract class Collect extends ImperativeAggregate

The Collect aggregate function collects all seen expression values into a list of values.
The Collect aggregate function collects all seen expression values into a list of values.
The operator is bound to the slower sort based aggregation path because the number of elements (and their memory usage) can not be determined in advance. This also means that the collected elements are stored on heap, and that too many elements can cause GC pauses and eventually Out of Memory Errors.
case class CollectList(child: Expression, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends Collect with Product with Serializable

Collect a list of elements.
Collect a list of elements.

Annotations
@ExpressionDescription()
case class CollectSet(child: Expression, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends Collect with Product with Serializable

Collect a list of unique elements.
Collect a list of unique elements.

Annotations
@ExpressionDescription()
case class Corr(x: Expression, y: Expression) extends DeclarativeAggregate with Product with Serializable

Compute Pearson correlation between two expressions.
Compute Pearson correlation between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
Definition of Pearson correlation can be found at http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Annotations
@ExpressionDescription()
case class Count(children: Seq[Expression]) extends DeclarativeAggregate with Product with Serializable

Annotations
@ExpressionDescription()
case class CovPopulation(left: Expression, right: Expression) extends Covariance with Product with Serializable

Annotations
@ExpressionDescription()
case class CovSample(left: Expression, right: Expression) extends Covariance with Product with Serializable

Annotations
@ExpressionDescription()
abstract class Covariance extends DeclarativeAggregate

Compute the covariance between two expressions.
Compute the covariance between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
abstract class DeclarativeAggregate extends AggregateFunction with Serializable with Unevaluable

API for aggregation functions that are expressed in terms of Catalyst expressions.
API for aggregation functions that are expressed in terms of Catalyst expressions.
When implementing a new expression-based aggregate function, start by implementing bufferAttributes, defining attributes for the fields of the mutable aggregation buffer. You can then use these attributes when defining updateExpressions, mergeExpressions, and evaluateExpressions.
Please note that children of an aggregate function can be unresolved (it will happen when we create this function in DataFrame API). So, if there is any fields in the implemented class that need to access fields of its children, please make those fields lazy vals.
case class First(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate with Product with Serializable

Returns the first value of child for a group of rows.
Returns the first value of child for a group of rows. If the first value of child is null, it returns null (respecting nulls). Even if First is used on a already sorted column, if we do partial aggregation and final aggregation (when mergeExpression is used) its result will not be deterministic (unless the input table is sorted and has a single partition, and we use a single reducer to do the aggregation.).

Annotations
@ExpressionDescription()
case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with Product with Serializable

HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm.
HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm. This class implements the dense version of the HLL++ algorithm as an Aggregate Function.
This implementation has been based on the following papers: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf
Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
child
to estimate the cardinality of.
relativeSD
the maximum estimation error allowed.

Annotations
@ExpressionDescription()
abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback

API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
Within these functions, code should access fields of the mutable aggregation buffer by adding the bufferSchema-relative field number to mutableAggBufferOffset then using this new field number to access the buffer Row. This is necessary because this aggregation function's buffer is embedded inside of a larger shared aggregation buffer when an aggregation operator evaluates multiple aggregate functions at the same time.
We need to perform similar field number arithmetic when merging multiple intermediate aggregate buffers together in merge() (in this case, use inputAggBufferOffset when accessing the input buffer).
Correct ImperativeAggregate evaluation depends on the correctness of mutableAggBufferOffset and inputAggBufferOffset, but not on the correctness of the attribute ids in aggBufferAttributes and inputAggBufferAttributes.
case class Kurtosis(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()
case class Last(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate with Product with Serializable

Returns the last value of child for a group of rows.
Returns the last value of child for a group of rows. If the last value of child is null, it returns null (respecting nulls). Even if Last is used on a already sorted column, if we do partial aggregation and final aggregation (when mergeExpression is used) its result will not be deterministic (unless the input table is sorted and has a single partition, and we use a single reducer to do the aggregation.).

Annotations
@ExpressionDescription()
case class Max(child: Expression) extends DeclarativeAggregate with Product with Serializable

Annotations
@ExpressionDescription()
case class Min(child: Expression) extends DeclarativeAggregate with Product with Serializable

Annotations
@ExpressionDescription()
case class PivotFirst(pivotColumn: Expression, valueColumn: Expression, pivotColumnValues: Seq[Any], mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with Product with Serializable

PivotFirst is a aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
PivotFirst is a aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
For example on an input of A | B --+-- x | 1 y | 2 z | 3
with pivotColumn=A, valueColumn=B, and pivotColumnValues=[z,y] the output is [3,2].
pivotColumn
column that determines which output position to put valueColumn in.
valueColumn
the column that is being rearranged.
pivotColumnValues
the list of pivotColumn values in the order of desired output. Values not listed here will be ignored.
case class Skewness(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()
case class StddevPop(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()
case class StddevSamp(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()
case class Sum(child: Expression) extends DeclarativeAggregate with Product with Serializable

Annotations
@ExpressionDescription()
case class VariancePop(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()
case class VarianceSamp(child: Expression) extends CentralMomentAgg with Product with Serializable

Annotations
@ExpressionDescription()

Value Members

object AggregateExpression extends Serializable
object Count extends Serializable
object HyperLogLogPlusPlus extends Serializable

Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
See the Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality (https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen) for more information.
object PivotFirst extends Serializable

package aggregate

Type Members

sealed abstract class AggregateFunction extends Expression with ImplicitCastInputTypes

case class Average(child: Expression) extends DeclarativeAggregate with Product with Serializable

abstract class CentralMomentAgg extends DeclarativeAggregate

abstract class Collect extends ImperativeAggregate

case class CollectList(child: Expression, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends Collect with Product with Serializable

case class CollectSet(child: Expression, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends Collect with Product with Serializable

case class Corr(x: Expression, y: Expression) extends DeclarativeAggregate with Product with Serializable

case class Count(children: Seq[Expression]) extends DeclarativeAggregate with Product with Serializable

case class CovPopulation(left: Expression, right: Expression) extends Covariance with Product with Serializable

case class CovSample(left: Expression, right: Expression) extends Covariance with Product with Serializable

abstract class Covariance extends DeclarativeAggregate

abstract class DeclarativeAggregate extends AggregateFunction with Serializable with Unevaluable

case class First(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate with Product with Serializable

case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with Product with Serializable

abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback

case class Kurtosis(child: Expression) extends CentralMomentAgg with Product with Serializable

case class Last(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate with Product with Serializable

case class Max(child: Expression) extends DeclarativeAggregate with Product with Serializable

case class Min(child: Expression) extends DeclarativeAggregate with Product with Serializable

case class PivotFirst(pivotColumn: Expression, valueColumn: Expression, pivotColumnValues: Seq[Any], mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with Product with Serializable

case class Skewness(child: Expression) extends CentralMomentAgg with Product with Serializable

case class StddevPop(child: Expression) extends CentralMomentAgg with Product with Serializable

case class StddevSamp(child: Expression) extends CentralMomentAgg with Product with Serializable

case class Sum(child: Expression) extends DeclarativeAggregate with Product with Serializable

case class VariancePop(child: Expression) extends CentralMomentAgg with Product with Serializable

case class VarianceSamp(child: Expression) extends CentralMomentAgg with Product with Serializable

Value Members

object AggregateExpression extends Serializable

object Count extends Serializable

object HyperLogLogPlusPlus extends Serializable

object PivotFirst extends Serializable

Ungrouped