Class

org.apache.spark.sql.catalyst.analysis

DistinctAggregationRewriter

Related Doc: package analysis

Permalink

case class DistinctAggregationRewriter(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.

For example (in scala):

val data = Seq(
  ("a", "ca1", "cb1", 10),
  ("a", "ca1", "cb2", 5),
  ("b", "ca1", "cb1", 13))
  .toDF("key", "cat1", "cat2", "value")
data.registerTempTable("data")

val agg = data.groupBy($"key")
  .agg(
    countDistinct($"cat1").as("cat1_cnt"),
    countDistinct($"cat2").as("cat2_cnt"),
    sum($"value").as("total"))

This translates to the following (pseudo) logical plan:

Aggregate(
   key = ['key]
   functions = [COUNT(DISTINCT 'cat1),
                COUNT(DISTINCT 'cat2),
                sum('value)]
   output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
  LocalTableScan [...]

This rule rewrites this logical plan to the following (pseudo) logical plan:

Aggregate(
   key = ['key]
   functions = [count(if (('gid = 1)) 'cat1 else null),
                count(if (('gid = 2)) 'cat2 else null),
                first(if (('gid = 0)) 'total else null) ignore nulls]
   output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
  Aggregate(
     key = ['key, 'cat1, 'cat2, 'gid]
     functions = [sum('value)]
     output = ['key, 'cat1, 'cat2, 'gid, 'total])
    Expand(
       projections = [('key, null, null, 0, cast('value as bigint)),
                      ('key, 'cat1, null, 1, null),
                      ('key, null, 'cat2, 2, null)]
       output = ['key, 'cat1, 'cat2, 'gid, 'value])
      LocalTableScan [...]

The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:

  1. the non-distinct group; ii. the distinct 'cat1 group; iii. the distinct 'cat2 group. An expand operator is inserted to expand the child data for each group. The expand will null out all unused columns for the given group; this must be done in order to ensure correctness later on. Groups can by identified by a group id (gid) column added by the expand operator. 2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of this aggregate consists of the original group by clause, all the requested distinct columns and the group id. Both de-duplication of distinct column and the aggregation of the non-distinct group take advantage of the fact that we group by the group id (gid) and that we have nulled out all non-relevant columns for the the given group. 3. Aggregating the distinct groups and combining this with the results of the non-distinct aggregation. In this step we use the group id to filter the inputs for the aggregate functions. The result of the non-distinct group are 'aggregated' by using the first operator, it might be more elegant to use the native UDAF merge mechanism for this in the future.

This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low a possible should be priority, we could improve this in the current rule by applying more advanced expression cannocalization techniques.

Linear Supertypes
Serializable, Serializable, Product, Equals, Rule[LogicalPlan], Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DistinctAggregationRewriter
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. Rule
  7. Logging
  8. AnyRef
  9. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DistinctAggregationRewriter(conf: CatalystConf)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def apply(plan: LogicalPlan): LogicalPlan

    Permalink
    Definition Classes
    DistinctAggregationRewriterRule
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. val conf: CatalystConf

    Permalink
  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  11. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  12. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  13. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  14. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  15. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  16. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  17. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  18. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  19. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  20. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  21. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  22. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  26. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  27. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  28. def rewrite(a: Aggregate): Aggregate

    Permalink
  29. val ruleName: String

    Permalink

    Name for this rule, automatically inferred based on class name.

    Name for this rule, automatically inferred based on class name.

    Definition Classes
    Rule
  30. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  31. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  32. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from Rule[LogicalPlan]

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped