A relation produced by applying func
to each element of the child
, concatenating the
resulting columns at the end of the input row.
An optimized version of AppendColumns, that can be executed on deserialized object directly.
A logical plan node with a left and right child.
A relation produced by applying func
to each grouping key and associated values from left and
right children.
Statistics collected for a column.
Statistics collected for a column.
1. Supported data types are defined in ColumnStat.supportsType
.
2. The JVM data type stored in min/max is the internal data type for the corresponding
Catalyst data type. For example, the internal type of DateType is Int, and that the internal
type of TimestampType is Long.
3. There is no guarantee that the statistics collected are accurate. Approximation algorithms
(sketches) might have been used, and the data collected can also be stale.
number of distinct values
minimum value
maximum value
number of nulls
average length of the values. For fixed-length types, this should be a constant.
maximum length of the values. For fixed-length types, this should be a constant.
A logical node that represents a non-query command to be executed by the system.
A logical node that represents a non-query command to be executed by the system. For example, commands can be used by parsers to represent DDL operations. Commands, unlike queries, are eagerly executed.
A logical plan for dropDuplicates
.
Takes the input row from child and turns it into object using the given deserializer expression.
Returns a new logical plan that dedups input rows.
Used to mark a user specified column as holding the event time for a row.
Apply a number of projections to every input row, hence we will get multiple output rows for an input row.
Apply a number of projections to every input row, hence we will get multiple output rows for an input row.
to apply
of all projections.
operator.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
,
while using state data.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
,
while using state data.
Func is invoked with an object representation of the grouping key an iterator containing the
object representation of all the rows with that key.
function called on each group
used to extract the key object for each group.
used to extract the items in the iterator from an input row.
used to group the data
used to read the data
used to define the output object
used to serialize/deserialize state before calling func
the output mode of func
whether it is created by the mapGroupsWithState
method
used to timeout groups that have not received data in a while
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.
Applies a Generator to a stream of input rows, combining the
output of each into a new stream of rows. This operation is similar to a flatMap
in functional
programming with one important additional feature, which allows the input rows to be joined with
their output.
the generator expression
when true, each output row is implicitly joined with the input tuple that produced it.
when true, each input row will be output at least once, even if the output of the
given generator
is empty.
Qualifier for the attributes of generator(UDTF)
The output schema of the Generator.
Children logical plan node
A GROUP BY clause with GROUPING SETS can generate a result set equivalent to generated by a UNION ALL of multiple simple GROUP BY clauses.
A GROUP BY clause with GROUPING SETS can generate a result set equivalent to generated by a UNION ALL of multiple simple GROUP BY clauses.
We will transform GROUPING SETS into logical plan Aggregate(.., Expand) in Analyzer
A sequence of selected GroupBy expressions, all exprs should exist in groupByExprs.
The Group By expressions candidates.
Child operator
The Aggregation expressions, those non selected group by expressions will be considered as constant null if it appears in the expressions
Insert some data into a table.
Insert some data into a table. Note that this plan is unresolved and has to be replaced by the concrete implementations during analysis.
the logical plan representing the table. In the future this should be a org.apache.spark.sql.catalyst.catalog.CatalogTable once we converge Hive tables and data source tables.
a map from the partition key to the partition value (optional). If the partition
value is optional, dynamic partition insert will be performed.
As an example, INSERT INTO tbl PARTITION (a=1, b=2) AS ...
would have
Map('a' -> Some('1'), 'b' -> Some('2')),
and INSERT INTO tbl PARTITION (a=1, b) AS ...
would have Map('a' -> Some('1'), 'b' -> None).
the logical plan representing data to write to.
overwrite existing table or partitions.
If true, only write if the partition does not exist. Only valid for static partitions.
A logical plan node with no children.
Internal class representing State
A relation produced by applying func
to each element of the child
.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
.
Func is invoked with an object representation of the grouping key an iterator containing the
object representation of all the rows with that key.
used to extract the key object for each group.
used to extract the items in the iterator from an input row.
A relation produced by applying func
to each partition of the child
.
A relation produced by applying a serialized R function func
to each partition of the child
.
A trait for logical operators that consumes domain objects as input.
A trait for logical operators that consumes domain objects as input. The output of its child must be a single-field row containing the input object.
A trait for logical operators that produces domain objects as output.
A trait for logical operators that produces domain objects as output. The output of this operator is a single-field safe row containing the produced object.
Returns a new RDD that has exactly numPartitions
partitions.
Returns a new RDD that has exactly numPartitions
partitions. Differs from
RepartitionByExpression as this method is called directly by DataFrame's, because the user
asked for coalesce
or repartition
. RepartitionByExpression is used when the consumer
of the output requires some specific ordering or distribution of the data.
This method repartitions data using Expressions into numPartitions
, and receives
information about the number of partitions during execution.
This method repartitions data using Expressions into numPartitions
, and receives
information about the number of partitions during execution. Used when a specific ordering or
distribution is expected by the consumer of the query result. Use Repartition for RDD-like
coalesce
and repartition
.
A base interface for RepartitionByExpression and Repartition
A resolved hint node.
A resolved hint node. The analyzer should convert all UnresolvedHint into ResolvedHint.
When planning take() or collect() operations, this special node that is inserted at the top of the logical plan before invoking the query planner.
When planning take() or collect() operations, this special node that is inserted at the top of the logical plan before invoking the query planner.
Rules can pattern-match on this node in order to apply transformations that only take effect at the top of the logical query plan.
Sample the dataset.
Sample the dataset.
Lower-bound of the sampling probability (usually 0.0)
Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.
Whether to sample with replacement.
the random seed
the LogicalPlan
Is created from TABLESAMPLE in the parser.
Input and output properties when passing data to a script.
Input and output properties when passing data to a script. For example, in Hive this would specify which SerDes to use.
Transforms the input by forking and running the specified script.
Transforms the input by forking and running the specified script.
the set of expression that should be passed to the script.
the command that should be executed.
the attributes that are produced by the script.
the input and output schema applied in the execution of the script.
Takes the input object from child and turns it into unsafe row using the given serializer expression.
The ordering expressions
True means global sorting apply for entire data set, False means sorting only apply within the partition.
Child logical plan
Estimates of various statistics.
Estimates of various statistics. The default estimation logic simply lazily multiplies the
corresponding statistic produced by the children. To override this behavior, override
statistics
and assign it an overridden version of Statistics
.
NOTE: concrete and/or overridden versions of statistics fields should pay attention to the performance of the implementations. The reason is that estimations might get triggered in performance-critical processes, such as query plan planning.
Note that we are using a BigInt here since it is easy to overflow a 64-bit integer in cardinality estimation (e.g. cartesian joins).
Physical size in bytes. For leaf operators this defaults to 1, otherwise it
defaults to the product of children's sizeInBytes
.
Estimated number of rows.
Statistics for Attributes.
Query hints.
This node is inserted at the top of a subquery when it is optimized.
This node is inserted at the top of a subquery when it is optimized. This makes sure we can recognize a subquery as such, and it allows us to write subquery aware transformations.
A relation produced by applying func
to each element of the child
and filter them by the
resulting boolean value.
A relation produced by applying func
to each element of the child
and filter them by the
resulting boolean value.
This is logically equal to a normal Filter operator whose condition expression is decoding the input row to object and apply the given function with decoded object. However we need the encapsulation of TypedFilter to make the concept more clear and make it easier to write optimizer rules.
A logical plan node with single child.
A general hint for the child that is not yet resolved.
A general hint for the child that is not yet resolved. This node is generated by the parser and should be removed This node will be eliminated post analysis.
the name of the hint
the parameters of the hint
the LogicalPlan on which this hint applies
A container for holding the view description(CatalogTable), and the output of the view.
A container for holding the view description(CatalogTable), and the output of the view. The
child should be a logical plan parsed from the CatalogTable.viewText
, should throw an error
if the viewText
is not defined.
This operator will be removed at the end of analysis stage.
A view description(CatalogTable) that provides necessary information to resolve the view.
The output of a view operator, this is generated during planning the view, so that we are able to decouple the output from the underlying structure.
The logical plan of a view operator, it should be a logical plan parsed from the
CatalogTable.viewText
, should throw an error if the viewText
is not defined.
A container for holding named common table expressions (CTEs) and a query plan.
A container for holding named common table expressions (CTEs) and a query plan. This operator will be removed during analysis and the relations will be substituted into child.
The final query of this CTE.
A sequence of pair (alias, the CTE definition) that this CTE defined Each CTE can see the base tables and the previously defined CTEs only.
Factory for constructing new AppendColumn
nodes.
Factory for constructing new CoGroup
nodes.
Factory for constructing new FlatMapGroupsInR
nodes.
Factory for constructing new MapGroupsWithState
nodes.
Factory for constructing new MapGroups
nodes.
Types of timeouts used in FlatMapGroupsWithState
A relation with one row.
A relation with one row. This is used in "SELECT ..." without a from clause.
Factory for constructing new Range
nodes.
Factory for constructing new Union
nodes.
A relation produced by applying
func
to each element of thechild
, concatenating the resulting columns at the end of the input row.used to extract the input to
func
from an input row.use to serialize the output of
func
.