Provides a logical query plan analyzer, which translates UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog and a FunctionRegistry.
An interface for looking up relations by name.
An interface for looking up relations by name. Used by an Analyzer.
Throws user facing errors when passed invalid queries that fail to analyze.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.
For example (in scala):
val data = Seq( ("a", "ca1", "cb1", 10), ("a", "ca1", "cb2", 5), ("b", "ca1", "cb1", 13)) .toDF("key", "cat1", "cat2", "value") data.registerTempTable("data") val agg = data.groupBy($"key") .agg( countDistinct($"cat1").as("cat1_cnt"), countDistinct($"cat2").as("cat2_cnt"), sum($"value").as("total"))
This translates to the following (pseudo) logical plan:
Aggregate(
key = ['key]
functions = [COUNT(DISTINCT 'cat1),
COUNT(DISTINCT 'cat2),
sum('value)]
output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
LocalTableScan [...]
This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count(if (('gid = 1)) 'cat1 else null), count(if (('gid = 2)) 'cat2 else null), first(if (('gid = 0)) 'total else null) ignore nulls] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [sum('value)] output = ['key, 'cat1, 'cat2, 'gid, 'total]) Expand( projections = [('key, null, null, 0, cast('value as bigint)), ('key, 'cat1, null, 1, null), ('key, null, 'cat2, 2, null)] output = ['key, 'cat1, 'cat2, 'gid, 'value]) LocalTableScan [...]
The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:
This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low a possible should be priority, we could improve this in the current rule by applying more advanced expression cannocalization techniques.
A catalog for looking up user defined functions, used by an Analyzer.
Used to assign new names to Generator's output, such as hive udtf.
Used to assign new names to Generator's output, such as hive udtf. For example the SQL expression "stack(2, key, value, key, value) as (a, b)" could be represented as follows: MultiAlias(stack_function, Seq(a, b))
the computation being performed
the names to be associated with each output of computing child.
A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan.
A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan. It is invalid to have multiple copies of the same attribute produced by distinct operators in a query tree as this breaks the guarantee that expression ids, which are used to differentiate attributes, are unique.
During analysis, operators that include this trait may be asked to produce a new version of itself with globally unique expression ids.
Thrown by a catalog when a table cannot be found.
Thrown by a catalog when a table cannot be found. The analyzer will rethrow the exception as an AnalysisException with the correct position information.
A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans.
A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans. This can be used to bind query result to virtual tables, or replace tables with in-memory cached versions. Note that the set of overrides is stored in memory and thus lost when the JVM exits.
Represents all the resolved input attributes to a given relational operator.
Represents all the resolved input attributes to a given relational operator. This is used in the data frame DSL.
Expressions to expand.
Resolver should return true if the first string refers to the same entity as the second string.
Resolver should return true if the first string refers to the same entity as the second string. For example, by using case insensitive equality.
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...". A Star gets automatically expanded during analysis.
Represents the result of Expression.checkInputDataTypes
.
Represents the result of Expression.checkInputDataTypes
.
We will throw AnalysisException
in CheckAnalysis
if isFailure
is true.
Holds the expression that has yet to be aliased.
Holds the name of an attribute that has yet to be resolved.
Thrown when an invalid attempt is made to access a property of a tree that has yet to be fully resolved.
Extracts a value or values from an Expression
Extracts a value or values from an Expression
The expression to extract value from, can be Map, Array, Struct or array of Structs.
The expression to describe the extraction, can be key of Map, index of Array, field name of Struct.
Holds the name of a relation that has yet to be looked up in a Catalog.
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".
This is also used to expand structs. For example: "SELECT record.* from (SELECT struct(a,b,c) as record ...)
an optional name that should be the target of the expansion. If omitted all targets' columns are produced. This can either be a table name or struct name. This is a list of identifiers that is the path of the expansion.
Cleans up unnecessary Aliases inside the plan.
Cleans up unnecessary Aliases inside the plan. Basically we only need Alias as a top level expression in Project(project list) or Aggregate(aggregate expressions) or Window(window expressions).
Computes the current date and time to make sure we return the same result in a single query.
Removes Subquery operators from the plan.
Removes Subquery operators from the plan. Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
A trivial catalog that returns an error when a relation is requested.
A trivial catalog that returns an error when a relation is requested. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.
A trivial catalog that returns an error when a function is requested.
A trivial catalog that returns an error when a function is requested. Used for testing when all functions are already filled in and the analyzer needs only to resolve attribute references.
A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones.
A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones. Most of these rules are based on Hive semantics, but they do not introduce any dependencies on the hive codebase. For this reason they remain in Catalyst until we have a more standard set of coercions.
Replace the UpCast
expression by Cast
, and throw exceptions if the cast may truncate.
A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry.
A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.
Catches any AnalysisExceptions thrown by f
and attaches t
's position if any.
Provides a logical query plan Analyzer and supporting classes for performing analysis. Analysis consists of translating UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog.