aggregate

Type Members

abstract class AggregationIterator extends Iterator[UnsafeRow] with Logging

The base class of SortBasedAggregationIterator and TungstenAggregationIterator.
The base class of SortBasedAggregationIterator and TungstenAggregationIterator. It mainly contains two parts: 1. It initializes aggregate functions. 2. It creates two functions, processRow and generateOutput based on AggregateMode of its aggregate functions. processRow is the function to handle an input. generateOutput is used to generate result.
sealed trait BufferSetterGetterUtils extends AnyRef

A helper trait used to create specialized setter and getter for types supported by org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap's buffer.
A helper trait used to create specialized setter and getter for types supported by org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap's buffer. (see UnsafeFixedWidthAggregationMap.supportsAggregationBufferSchema).
case class HashAggregateExec(requiredChildDistributionExpressions: Option[Seq[Expression]], groupingExpressions: Seq[NamedExpression], aggregateExpressions: Seq[AggregateExpression], aggregateAttributes: Seq[Attribute], initialInputBufferOffset: Int, resultExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

Hash-based aggregate operator that can also fallback to sorting when data exceeds memory size.
abstract class HashMapGenerator extends AnyRef

This is a helper class to generate an append-only row-based hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found).
This is a helper class to generate an append-only row-based hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found). This is 'codegened' in HashAggregate to speed up aggregates w/ key.
NOTE: the generated hash map currently doesn't support nullable keys and falls back to the BytesToBytesMap to store them.
class RowBasedHashMapGenerator extends HashMapGenerator

This is a helper class to generate an append-only row-based hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found).
This is a helper class to generate an append-only row-based hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found). This is 'codegened' in HashAggregate to speed up aggregates w/ key.
We also have VectorizedHashMapGenerator, which generates a append-only vectorized hash map. We choose one of the two as the 1st level, fast hash map during aggregation.
NOTE: This row-based hash map currently doesn't support nullable keys and falls back to the BytesToBytesMap to store them.
case class ScalaUDAF(children: Seq[Expression], udaf: UserDefinedAggregateFunction, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with NonSQLExpression with Logging with Product with Serializable

The internal wrapper used to hook a UserDefinedAggregateFunction udaf in the internal aggregation code path.
case class SortAggregateExec(requiredChildDistributionExpressions: Option[Seq[Expression]], groupingExpressions: Seq[NamedExpression], aggregateExpressions: Seq[AggregateExpression], aggregateAttributes: Seq[Attribute], initialInputBufferOffset: Int, __resultExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

Sort-based aggregate operator.
class SortBasedAggregationIterator extends AggregationIterator

An iterator used to evaluate AggregateFunction.
An iterator used to evaluate AggregateFunction. It assumes the input rows have been sorted by values of groupingExpressions.
class TungstenAggregationIterator extends AggregationIterator with Logging

An iterator used to evaluate aggregate functions.
An iterator used to evaluate aggregate functions. It operates on UnsafeRows.
This iterator first uses hash-based aggregation to process input rows. It uses a hash map to store groups and their corresponding aggregation buffers. If this map cannot allocate memory from memory manager, it spills the map into disk and creates a new one. After processed all the input, then merge all the spills together using external sorter, and do sort-based aggregation.
The process has the following step:
- Step 0: Do hash-based aggregation.
- Step 1: Sort all entries of the hash map based on values of grouping expressions and spill them to disk.
- Step 2: Create an external sorter based on the spilled sorted map entries and reset the map.
- Step 3: Get a sorted KVIterator from the external sorter.
- Step 4: Repeat step 0 until no more input.
- Step 5: Initialize sort-based aggregation on the sorted iterator. Then, this iterator works in the way of sort-based aggregation.
The code of this class is organized as follows:
- Part 1: Initializing aggregate functions.
- Part 2: Methods and fields used by setting aggregation buffer values, processing input rows from inputIter, and generating output rows.
- Part 3: Methods and fields used by hash-based aggregation.
- Part 4: Methods and fields used when we switch to sort-based aggregation.
- Part 5: Methods and fields used by sort-based aggregation.
- Part 6: Loads input and process input rows.
- Part 7: Public methods of this iterator.
- Part 8: A utility function used to generate a result when there is no input and there is no grouping expression.
case class TypedAggregateExpression(aggregator: expressions.Aggregator[Any, Any, Any], inputDeserializer: Option[Expression], inputClass: Option[Class[_]], inputSchema: Option[StructType], bufferSerializer: Seq[NamedExpression], bufferDeserializer: Expression, outputSerializer: Seq[Expression], outputExternalType: DataType, dataType: DataType, nullable: Boolean) extends DeclarativeAggregate with NonSQLExpression with Product with Serializable

A helper class to hook Aggregator into the aggregation system.
class TypedAverage[IN] extends expressions.Aggregator[IN, (Double, Long), Double]
class TypedCount[IN] extends expressions.Aggregator[IN, Long, Long]
class TypedSumDouble[IN] extends expressions.Aggregator[IN, Double, Double]
class TypedSumLong[IN] extends expressions.Aggregator[IN, Long, Long]
class VectorizedHashMapGenerator extends HashMapGenerator

This is a helper class to generate an append-only vectorized hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found).
This is a helper class to generate an append-only vectorized hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the BytesToBytesMap if a given key isn't found). This is 'codegened' in HashAggregate to speed up aggregates w/ key.
It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the BytesToBytesMap (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness. We also use a secondary columnar batch that logically projects over the original columnar batch and is equivalent to the BytesToBytesMap aggregate buffer.
NOTE: This vectorized hash map currently doesn't support nullable keys and falls back to the BytesToBytesMap to store them.

Value Members

object AggUtils

Utility functions used by the query planner to convert our plan to new aggregation code path.
object HashAggregateExec extends Serializable
object TypedAggregateExpression extends Serializable

package aggregate

Type Members

abstract class AggregationIterator extends Iterator[UnsafeRow] with Logging

sealed trait BufferSetterGetterUtils extends AnyRef

abstract class HashMapGenerator extends AnyRef

class RowBasedHashMapGenerator extends HashMapGenerator

case class ScalaUDAF(children: Seq[Expression], udaf: UserDefinedAggregateFunction, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends ImperativeAggregate with NonSQLExpression with Logging with Product with Serializable

class SortBasedAggregationIterator extends AggregationIterator

class TungstenAggregationIterator extends AggregationIterator with Logging

class TypedAverage[IN] extends expressions.Aggregator[IN, (Double, Long), Double]

class TypedCount[IN] extends expressions.Aggregator[IN, Long, Long]

class TypedSumDouble[IN] extends expressions.Aggregator[IN, Double, Double]

class TypedSumLong[IN] extends expressions.Aggregator[IN, Long, Long]

class VectorizedHashMapGenerator extends HashMapGenerator

Value Members

object AggUtils

object HashAggregateExec extends Serializable

object TypedAggregateExpression extends Serializable

Ungrouped