Package

org.apache.spark.sql.catalyst

analysis

Permalink

package analysis

Provides a logical query plan Analyzer and supporting classes for performing analysis. Analysis consists of translating UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. analysis
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Type Members

  1. implicit class AnalysisErrorAt extends AnyRef

    Permalink
  2. class Analyzer extends RuleExecutor[LogicalPlan] with CheckAnalysis

    Permalink

    Provides a logical query plan analyzer, which translates UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog and a FunctionRegistry.

  3. trait Catalog extends AnyRef

    Permalink

    An interface for looking up relations by name.

    An interface for looking up relations by name. Used by an Analyzer.

  4. trait CheckAnalysis extends AnyRef

    Permalink

    Throws user facing errors when passed invalid queries that fail to analyze.

  5. case class DistinctAggregationRewriter(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

    Permalink

    This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group.

    This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.

    For example (in scala):

    val data = Seq(
      ("a", "ca1", "cb1", 10),
      ("a", "ca1", "cb2", 5),
      ("b", "ca1", "cb1", 13))
      .toDF("key", "cat1", "cat2", "value")
    data.registerTempTable("data")
    
    val agg = data.groupBy($"key")
      .agg(
        countDistinct($"cat1").as("cat1_cnt"),
        countDistinct($"cat2").as("cat2_cnt"),
        sum($"value").as("total"))

    This translates to the following (pseudo) logical plan:

    Aggregate(
       key = ['key]
       functions = [COUNT(DISTINCT 'cat1),
                    COUNT(DISTINCT 'cat2),
                    sum('value)]
       output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
      LocalTableScan [...]

    This rule rewrites this logical plan to the following (pseudo) logical plan:

    Aggregate(
       key = ['key]
       functions = [count(if (('gid = 1)) 'cat1 else null),
                    count(if (('gid = 2)) 'cat2 else null),
                    first(if (('gid = 0)) 'total else null) ignore nulls]
       output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
      Aggregate(
         key = ['key, 'cat1, 'cat2, 'gid]
         functions = [sum('value)]
         output = ['key, 'cat1, 'cat2, 'gid, 'total])
        Expand(
           projections = [('key, null, null, 0, cast('value as bigint)),
                          ('key, 'cat1, null, 1, null),
                          ('key, null, 'cat2, 2, null)]
           output = ['key, 'cat1, 'cat2, 'gid, 'value])
          LocalTableScan [...]

    The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:

    1. the non-distinct group; ii. the distinct 'cat1 group; iii. the distinct 'cat2 group. An expand operator is inserted to expand the child data for each group. The expand will null out all unused columns for the given group; this must be done in order to ensure correctness later on. Groups can by identified by a group id (gid) column added by the expand operator. 2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of this aggregate consists of the original group by clause, all the requested distinct columns and the group id. Both de-duplication of distinct column and the aggregation of the non-distinct group take advantage of the fact that we group by the group id (gid) and that we have nulled out all non-relevant columns for the the given group. 3. Aggregating the distinct groups and combining this with the results of the non-distinct aggregation. In this step we use the group id to filter the inputs for the aggregate functions. The result of the non-distinct group are 'aggregated' by using the first operator, it might be more elegant to use the native UDAF merge mechanism for this in the future.

    This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low a possible should be priority, we could improve this in the current rule by applying more advanced expression cannocalization techniques.

  6. trait FunctionRegistry extends AnyRef

    Permalink

    A catalog for looking up user defined functions, used by an Analyzer.

  7. case class MultiAlias(child: Expression, names: Seq[String]) extends UnaryExpression with NamedExpression with CodegenFallback with Product with Serializable

    Permalink

    Used to assign new names to Generator's output, such as hive udtf.

    Used to assign new names to Generator's output, such as hive udtf. For example the SQL expression "stack(2, key, value, key, value) as (a, b)" could be represented as follows: MultiAlias(stack_function, Seq(a, b))

    child

    the computation being performed

    names

    the names to be associated with each output of computing child.

  8. trait MultiInstanceRelation extends AnyRef

    Permalink

    A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan.

    A trait that should be mixed into query operators where an single instance might appear multiple times in a logical query plan. It is invalid to have multiple copies of the same attribute produced by distinct operators in a query tree as this breaks the guarantee that expression ids, which are used to differentiate attributes, are unique.

    During analysis, operators that include this trait may be asked to produce a new version of itself with globally unique expression ids.

  9. class NoSuchDatabaseException extends Exception

    Permalink
  10. class NoSuchTableException extends Exception

    Permalink

    Thrown by a catalog when a table cannot be found.

    Thrown by a catalog when a table cannot be found. The analyzer will rethrow the exception as an AnalysisException with the correct position information.

  11. trait OverrideCatalog extends Catalog

    Permalink

    A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans.

    A trait that can be mixed in with other Catalogs allowing specific tables to be overridden with new logical plans. This can be used to bind query result to virtual tables, or replace tables with in-memory cached versions. Note that the set of overrides is stored in memory and thus lost when the JVM exits.

  12. case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable with Product with Serializable

    Permalink

    Represents all the resolved input attributes to a given relational operator.

    Represents all the resolved input attributes to a given relational operator. This is used in the data frame DSL.

    expressions

    Expressions to expand.

  13. type Resolver = (String, String) ⇒ Boolean

    Permalink

    Resolver should return true if the first string refers to the same entity as the second string.

    Resolver should return true if the first string refers to the same entity as the second string. For example, by using case insensitive equality.

  14. class SimpleCatalog extends Catalog

    Permalink
  15. class SimpleFunctionRegistry extends FunctionRegistry

    Permalink
  16. abstract class Star extends LeafExpression with NamedExpression

    Permalink

    Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".

    Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...". A Star gets automatically expanded during analysis.

  17. trait TypeCheckResult extends AnyRef

    Permalink

    Represents the result of Expression.checkInputDataTypes.

    Represents the result of Expression.checkInputDataTypes. We will throw AnalysisException in CheckAnalysis if isFailure is true.

  18. case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression with Unevaluable with Product with Serializable

    Permalink

    Holds the expression that has yet to be aliased.

  19. case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable with Product with Serializable

    Permalink

    Holds the name of an attribute that has yet to be resolved.

  20. class UnresolvedException[TreeType <: TreeNode[_]] extends TreeNodeException[TreeType]

    Permalink

    Thrown when an invalid attempt is made to access a property of a tree that has yet to be fully resolved.

  21. case class UnresolvedExtractValue(child: Expression, extraction: Expression) extends UnaryExpression with Unevaluable with Product with Serializable

    Permalink

    Extracts a value or values from an Expression

    Extracts a value or values from an Expression

    child

    The expression to extract value from, can be Map, Array, Struct or array of Structs.

    extraction

    The expression to describe the extraction, can be key of Map, index of Array, field name of Struct.

  22. case class UnresolvedFunction(name: String, children: Seq[Expression], isDistinct: Boolean) extends Expression with Unevaluable with Product with Serializable

    Permalink
  23. case class UnresolvedRelation(tableIdentifier: TableIdentifier, alias: Option[String] = None) extends LeafNode with Product with Serializable

    Permalink

    Holds the name of a relation that has yet to be looked up in a Catalog.

  24. case class UnresolvedStar(target: Option[Seq[String]]) extends Star with Unevaluable with Product with Serializable

    Permalink

    Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".

    Represents all of the input attributes to a given relational operator, for example in "SELECT * FROM ...".

    This is also used to expand structs. For example: "SELECT record.* from (SELECT struct(a,b,c) as record ...)

    target

    an optional name that should be the target of the expansion. If omitted all targets' columns are produced. This can either be a table name or struct name. This is a list of identifiers that is the path of the expansion.

Value Members

  1. object CleanupAliases extends Rule[LogicalPlan]

    Permalink

    Cleans up unnecessary Aliases inside the plan.

    Cleans up unnecessary Aliases inside the plan. Basically we only need Alias as a top level expression in Project(project list) or Aggregate(aggregate expressions) or Window(window expressions).

  2. object ComputeCurrentTime extends Rule[LogicalPlan]

    Permalink

    Computes the current date and time to make sure we return the same result in a single query.

  3. object EliminateSubQueries extends Rule[LogicalPlan]

    Permalink

    Removes Subquery operators from the plan.

    Removes Subquery operators from the plan. Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.

  4. object EmptyCatalog extends Catalog

    Permalink

    A trivial catalog that returns an error when a relation is requested.

    A trivial catalog that returns an error when a relation is requested. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.

  5. object EmptyFunctionRegistry extends FunctionRegistry

    Permalink

    A trivial catalog that returns an error when a function is requested.

    A trivial catalog that returns an error when a function is requested. Used for testing when all functions are already filled in and the analyzer needs only to resolve attribute references.

  6. object FunctionRegistry

    Permalink
  7. object HiveTypeCoercion

    Permalink

    A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones.

    A collection of Rules that can be used to coerce differing types that participate in operations into compatible ones. Most of these rules are based on Hive semantics, but they do not introduce any dependencies on the hive codebase. For this reason they remain in Catalyst until we have a more standard set of coercions.

  8. object ResolveUpCast extends Rule[LogicalPlan]

    Permalink

    Replace the UpCast expression by Cast, and throw exceptions if the cast may truncate.

  9. object SimpleAnalyzer extends Analyzer

    Permalink

    A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry.

    A trivial Analyzer with an EmptyCatalog and EmptyFunctionRegistry. Used for testing when all relations are already filled in and the analyzer needs only to resolve attribute references.

  10. object TypeCheckResult

    Permalink
  11. object UnresolvedAttribute extends Serializable

    Permalink
  12. val caseInsensitiveResolution: (String, String) ⇒ Boolean

    Permalink
  13. val caseSensitiveResolution: (String, String) ⇒ Boolean

    Permalink
  14. def withPosition[A](t: TreeNode[_])(f: ⇒ A): A

    Permalink

    Catches any AnalysisExceptions thrown by f and attaches t's position if any.

Inherited from AnyRef

Inherited from Any

Ungrouped