analysis

Type Members

implicit class AnalysisErrorAt extends AnyRef
class Analyzer extends RuleExecutor[LogicalPlan] with CheckAnalysis

Provides a logical query plan analyzer, which translates UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog and a FunctionRegistry.
trait Catalog extends AnyRef

An interface for looking up relations by name.
An interface for looking up relations by name. Used by an Analyzer.
trait CheckAnalysis extends AnyRef

Throws user facing errors when passed invalid queries that fail to analyze.
case class DistinctAggregationRewriter(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.
For example (in scala):
```
val data = Seq(
  ("a", "ca1", "cb1", 10),
  ("a", "ca1", "cb2", 5),
  ("b", "ca1", "cb1", 13))
  .toDF("key", "cat1", "cat2", "value")
data.registerTempTable("data")

val agg = data.groupBy($"key")
  .agg(
    countDistinct($"cat1").as("cat1_cnt"),
    countDistinct($"cat2").as("cat2_cnt"),
    sum($"value").as("total"))
```
This translates to the following (pseudo) logical plan:
```
Aggregate(
   key = ['key]
   functions = [COUNT(DISTINCT 'cat1),
                COUNT(DISTINCT 'cat2),
                sum('value)]
   output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
  LocalTableScan [...]
```
This rule rewrites this logical plan to the following (pseudo) logical plan:
```
Aggregate(
   key = ['key]
   functions = [count(if (('gid = 1)) 'cat1 else null),
                count(if (('gid = 2)) 'cat2 else null),
                first(if (('gid = 0)) 'total else null) ignore nulls]
   output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
  Aggregate(
     key = ['key, 'cat1, 'cat2, 'gid]
     functions = [sum('value)]
     output = ['key, 'cat1, 'cat2, 'gid, 'total])
    Expand(
       projections = [('key, null, null, 0, cast('value as bigint)),
                      ('key, 'cat1, null, 1, null),
                      ('key, null, 'cat2, 2, null)]
       output = ['key, 'cat1, 'cat2, 'gid, 'value])
      LocalTableScan [...]
```
The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:
1. the non-distinct group; ii. the distinct 'cat1 group; iii. the distinct 'cat2 group. An expand operator is inserted to expand the child data for each group. The expand will null out all unused columns for the given group; this must be done in order to ensure correctness later on. Groups can by identified by a group id (gid) column added by the expand operator. 2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of this aggregate consists of the original group by clause, all the requested distinct columns and the group id. Both de-duplication of distinct column and the aggregation of the non-distinct group take advantage of the fact that we group by the group id (gid) and that we have nulled out all non-relevant columns for the the given group. 3. Aggregating the distinct groups and combining this with the results of the non-distinct aggregation. In this step we use the group id to filter the inputs for the aggregate functions. The result of the non-distinct group are 'aggregated' by using the first operator, it might be more elegant to use the native UDAF merge mechanism for this in the future.
This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low a possible should be priority, we could improve this in the current rule by applying more advanced expression cannocalization techniques.
trait FunctionRegistry extends AnyRef

A catalog for looking up user defined functions, used by an Analyzer.
case class MultiAlias(child: Expression, names: Seq[String]) extends UnaryExpression with NamedExpression with CodegenFallback with Product with Serializable

Used to assign new names to Generator's output, such as hive udtf.
Used to assign new names to Generator's output, such as hive udtf. For example the SQL expression "stack(2, key, value, key, value) as (a, b)" could be represented as follows: MultiAlias(stack_function, Seq(a, b))
child
the computation being performed
names
the names to be associated with each output of computing child.
trait MultiInstanceRelation extends AnyRef

A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan.
A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan. It is invalid to have multiple copies of the same attribute produced by distinct operators in a query tree as this breaks the guarantee that expression ids, which are used to differentiate attributes, are unique.
During analysis, operators that include this trait may be asked to produce a new version of itself with globally unique expression ids.
class NoSuchDatabaseException extends Exception
class NoSuchTableException extends Exception

Thrown by a catalog when a table cannot be found.
Thrown by a catalog when a table cannot be found. The analyzer will rethrow the exception as an AnalysisException with the correct position information.
trait OverrideCatalog extends Catalog

A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans.
A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans. This can be used to bind query result to virtual tables, or replace tables with in-memory cached versions. Note that the set of overrides is stored in memory and thus lost when the JVM exits.
case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable with Product with Serializable

Represents all the resolved input attributes to a given relational operator.
Represents all the resolved input attributes to a given relational operator. This is used in the data frame DSL.
expressions
Expressions to expand.
type Resolver = (String, String) ⇒ Boolean

Resolver should return true if the first string refers to the same entity as the second string.
Resolver should return true if the first string refers to the same entity as the second string. For example, by using case insensitive equality.
class SimpleCatalog extends Catalog
class SimpleFunctionRegistry extends FunctionRegistry
abstract class Star extends LeafExpression with NamedExpression

Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...". A Star gets automatically expanded during analysis.
trait TypeCheckResult extends AnyRef

Represents the result of Expression.checkInputDataTypes.
Represents the result of Expression.checkInputDataTypes. We will throw AnalysisException in CheckAnalysis if isFailure is true.
case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression with Unevaluable with Product with Serializable

Holds the expression that has yet to be aliased.
case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable with Product with Serializable

Holds the name of an attribute that has yet to be resolved.
class UnresolvedException[TreeType <: TreeNode[_]] extends TreeNodeException[TreeType]

Thrown when an invalid attempt is made to access a property of a tree that has yet to be fully resolved.
case class UnresolvedExtractValue(child: Expression, extraction: Expression) extends UnaryExpression with Unevaluable with Product with Serializable

Extracts a value or values from an Expression
Extracts a value or values from an Expression
child
The expression to extract value from, can be Map, Array, Struct or array of Structs.
extraction
The expression to describe the extraction, can be key of Map, index of Array, field name of Struct.
case class UnresolvedFunction(name: String, children: Seq[Expression], isDistinct: Boolean) extends Expression with Unevaluable with Product with Serializable
case class UnresolvedRelation(tableIdentifier: TableIdentifier, alias: Option[String] = None) extends LeafNode with Product with Serializable

Holds the name of a relation that has yet to be looked up in a Catalog.
case class UnresolvedStar(target: Option[Seq[String]]) extends Star with Unevaluable with Product with Serializable

Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
This is also used to expand structs. For example: "SELECT record.* from (SELECT struct(a,b,c) as record ...)
target
an optional name that should be the target of the expansion. If omitted all targets' columns are produced. This can either be a table name or struct name. This is a list of identifiers that is the path of the expansion.

Value Members

object CleanupAliases extends Rule[LogicalPlan]

Cleans up unnecessary Aliases inside the plan.
Cleans up unnecessary Aliases inside the plan. Basically we only need Alias as a top level expression in Project(project list) or Aggregate(aggregate expressions) or Window(window expressions).
object ComputeCurrentTime extends Rule[LogicalPlan]

Computes the current date and time to make sure we return the same result in a single query.
object EliminateSubQueries extends Rule[LogicalPlan]

Removes Subquery operators from the plan.
Removes Subquery operators from the plan. Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
object EmptyCatalog extends Catalog

A trivial catalog that returns an error when a relation is requested.
A trivial catalog that returns an error when a relation is requested. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.
object EmptyFunctionRegistry extends FunctionRegistry

A trivial catalog that returns an error when a function is requested.
A trivial catalog that returns an error when a function is requested. Used for testing when all functions are already filled in and the analyzer needs only to resolve attribute references.
object FunctionRegistry
object HiveTypeCoercion

A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones.
A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones. Most of these rules are based on Hive semantics, but they do not introduce any dependencies on the hive codebase. For this reason they remain in Catalyst until we have a more standard set of coercions.
object ResolveUpCast extends Rule[LogicalPlan]

Replace the UpCast expression by Cast, and throw exceptions if the cast may truncate.
object SimpleAnalyzer extends Analyzer

A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry.
A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.
object TypeCheckResult
object UnresolvedAttribute extends Serializable
val caseInsensitiveResolution: (String, String) ⇒ Boolean
val caseSensitiveResolution: (String, String) ⇒ Boolean
def withPosition[A](t: TreeNode[_])(f: ⇒ A): A

Catches any AnalysisExceptions thrown by f and attaches t's position if any.

package analysis

Type Members

implicit class AnalysisErrorAt extends AnyRef

class Analyzer extends RuleExecutor[LogicalPlan] with CheckAnalysis

trait Catalog extends AnyRef

trait CheckAnalysis extends AnyRef

case class DistinctAggregationRewriter(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

trait FunctionRegistry extends AnyRef

case class MultiAlias(child: Expression, names: Seq[String]) extends UnaryExpression with NamedExpression with CodegenFallback with Product with Serializable

trait MultiInstanceRelation extends AnyRef

class NoSuchDatabaseException extends Exception

class NoSuchTableException extends Exception

trait OverrideCatalog extends Catalog

case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable with Product with Serializable

type Resolver = (String, String) ⇒ Boolean

class SimpleCatalog extends Catalog

class SimpleFunctionRegistry extends FunctionRegistry

abstract class Star extends LeafExpression with NamedExpression

trait TypeCheckResult extends AnyRef

case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression with Unevaluable with Product with Serializable

case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable with Product with Serializable

class UnresolvedException[TreeType <: TreeNode[_]] extends TreeNodeException[TreeType]

case class UnresolvedExtractValue(child: Expression, extraction: Expression) extends UnaryExpression with Unevaluable with Product with Serializable

case class UnresolvedFunction(name: String, children: Seq[Expression], isDistinct: Boolean) extends Expression with Unevaluable with Product with Serializable

case class UnresolvedRelation(tableIdentifier: TableIdentifier, alias: Option[String] = None) extends LeafNode with Product with Serializable

case class UnresolvedStar(target: Option[Seq[String]]) extends Star with Unevaluable with Product with Serializable

Value Members

object CleanupAliases extends Rule[LogicalPlan]

object ComputeCurrentTime extends Rule[LogicalPlan]

object EliminateSubQueries extends Rule[LogicalPlan]

object EmptyCatalog extends Catalog

object EmptyFunctionRegistry extends FunctionRegistry

object FunctionRegistry

object HiveTypeCoercion

object ResolveUpCast extends Rule[LogicalPlan]

object SimpleAnalyzer extends Analyzer

object TypeCheckResult

object UnresolvedAttribute extends Serializable

val caseInsensitiveResolution: (String, String) ⇒ Boolean

val caseSensitiveResolution: (String, String) ⇒ Boolean

def withPosition[A](t: TreeNode[_])(f: ⇒ A): A

Inherited from AnyRef

Inherited from Any

Ungrouped