python

package python

Ordering

Alphabetic

Visibility

Public
Protected

Package Members

package streaming

Type Members

case class ArrowAggregatePythonExec(groupingExpressions: Seq[NamedExpression], aggExpressions: Seq[AggregateExpression], resultExpressions: Seq[NamedExpression], child: SparkPlan, evalType: Int) extends SparkPlan with UnaryExecNode with PythonSQLMetrics with Product with Serializable
Physical node for aggregation with group aggregate vectorized UDF.
Physical node for aggregation with group aggregate vectorized UDF. Following eval types are supported:
- SQL_GROUPED_AGG_ARROW_UDF for Arrow UDF
- SQL_GROUPED_AGG_PANDAS_UDF for Pandas UDF
This plan works by sending the necessary (projected) input grouped data as Arrow record batches to the python worker, the python worker invokes the UDF and sends the results to the executor, finally the executor evaluates any post-aggregation expressions and join the result with the grouped key.
class ArrowEvalPythonEvaluatorFactory extends EvalPythonEvaluatorFactory
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable
A physical plan that evaluates a vectorized UDF.
A physical plan that evaluates a vectorized UDF. Following eval types are supported:
- SQL_ARROW_BATCHED_UDF for Arrow Optimized Python UDF
- SQL_SCALAR_ARROW_UDF for Scalar Arrow UDF
- SQL_SCALAR_ARROW_ITER_UDF for Scalar Iterator Arrow UDF
- SQL_SCALAR_PANDAS_UDF for Scalar Pandas UDF
- SQL_SCALAR_PANDAS_ITER_UDF for Scalar Iterator Pandas UDF
case class ArrowEvalPythonUDTFExec(udtf: PythonUDTF, requiredChildOutput: Seq[Attribute], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonUDTFExec with PythonSQLMetrics with Product with Serializable
A physical plan that evaluates a PythonUDTF using Apache Arrow.
A physical plan that evaluates a PythonUDTF using Apache Arrow. This is similar to ArrowEvalPythonExec.
udtf
the user-defined Python function.
requiredChildOutput
the required output of the child plan. It's used for omitting data generation that will be discarded next by a projection.
resultAttrs
the output schema of the Python UDTF.
child
the child plan.
evalType
the Python eval type.
trait ArrowOutputProcessor extends AnyRef
class ArrowOutputProcessorImpl extends ArrowOutputProcessor
class ArrowPythonRunner extends RowInputArrowPythonRunner
Similar to PythonUDFRunner, but exchange data with Python worker via Arrow stream.
class ArrowPythonUDTFRunner extends BasePythonRunner[Iterator[InternalRow], ColumnarBatch] with BatchedPythonArrowInput with BasicPythonArrowOutput
Similar to ArrowPythonRunner, but for PythonUDTFs.
class ArrowPythonWithNamedArgumentRunner extends RowInputArrowPythonRunner
Similar to PythonUDFWithNamedArgumentsRunner, but exchange data with Python worker via Arrow stream.
class ArrowWindowPythonEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow] with WindowEvaluatorFactoryBase
case class ArrowWindowPythonExec(windowExpression: Seq[NamedExpression], partitionSpec: Seq[Expression], orderSpec: Seq[SortOrder], child: SparkPlan, evalType: Int) extends SparkPlan with WindowExecBase with PythonSQLMetrics with Product with Serializable
This class calculates and outputs windowed aggregates over the rows in a single partition.
This class calculates and outputs windowed aggregates over the rows in a single partition. Following eval types are supported:
- SQL_WINDOW_AGG_ARROW_UDF for Arrow UDF
- SQL_WINDOW_AGG_PANDAS_UDF for Pandas UDF
This is similar to WindowExec. The main difference is that this node does not compute any window aggregation values. Instead, it computes the lower and upper bound for each window (i.e. window bounds) and pass the data and indices to Python worker to do the actual window aggregation.
It currently materializes all data associated with the same partition key and passes them to Python worker. This is not strictly necessary for sliding windows and can be improved (by possibly slicing data into overlapping chunks and stitching them together).
This class groups window expressions by their window boundaries so that window expressions with the same window boundaries can share the same window bounds. The window bounds are prepended to the data passed to the python worker.
For example, if we have: avg(v) over specifiedwindowframe(RowFrame, -5, 5), avg(v) over specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing), avg(v) over specifiedwindowframe(RowFrame, -3, 3), max(v) over specifiedwindowframe(RowFrame, -3, 3)
The python input will look like: (lower_bound_w1, upper_bound_w1, lower_bound_w3, upper_bound_w3, v)
where w1 is specifiedwindowframe(RowFrame, -5, 5) w2 is specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing) w3 is specifiedwindowframe(RowFrame, -3, 3)
Note that w2 doesn't have bound indices in the python input because it's unbounded window so it's bound indices will always be the same.
Bounded window and Unbounded window are evaluated differently in Python worker: (1) Bounded window takes the window bound indices in addition to the input columns. Unbounded window takes only input columns. (2) Bounded window evaluates the udf once per input row. Unbounded window evaluates the udf once per window partition. This is controlled by Python runner conf "window_bound_types"
The logic to compute window bounds is delegated to WindowFunctionFrame and shared with WindowExec
Note this doesn't support partial aggregation and all aggregation is computed from the entire window.
case class AttachDistributedSequenceExec(sequenceAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
A physical plan that adds a new long column with sequenceAttr that increases one by one.
A physical plan that adds a new long column with sequenceAttr that increases one by one. This is for 'distributed-sequence' default index in pandas API on Spark.
abstract class BaseArrowPythonRunner[IN, OUT <: AnyRef] extends BasePythonRunner[IN, OUT] with PythonArrowInput[IN] with PythonArrowOutput[OUT]
abstract class BasePythonUDFRunner extends BasePythonRunner[Array[Byte], Array[Byte]]
A helper class to run Python UDFs in Spark.
abstract class BaseSliceArrowOutputProcessor extends ArrowOutputProcessorImpl
class BatchEvalPythonEvaluatorFactory extends EvalPythonEvaluatorFactory
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable
A physical plan that evaluates a PythonUDF
case class BatchEvalPythonUDTFExec(udtf: PythonUDTF, requiredChildOutput: Seq[Attribute], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonUDTFExec with PythonSQLMetrics with Product with Serializable
A physical plan that evaluates a PythonUDTF.
A physical plan that evaluates a PythonUDTF. This is similar to BatchEvalPythonExec.
udtf
the user-defined Python function
requiredChildOutput
the required output of the child plan. It's used for omitting data generation that will be discarded next by a projection.
resultAttrs
the output schema of the Python UDTF.
child
the child plan
class CoGroupedArrowPythonRunner extends BasePythonRunner[(Iterator[InternalRow], Iterator[InternalRow]), ColumnarBatch] with BasicPythonArrowOutput
Python UDF Runner for cogrouped udfs.
Python UDF Runner for cogrouped udfs. It sends Arrow bathes from two different DataFrames, groups them in Python, and receive it back in JVM as batches of single DataFrame.
abstract class EvalPythonEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow]
trait EvalPythonExec extends SparkPlan with UnaryExecNode
A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
Python evaluation works by sending the necessary (projected) input data via a socket to an external Python process, and combine the result from the Python process with the original row.
For each row we send to Python, we also put it in a queue first. For each output row from Python, we drain the queue to find the original input row. Note that if the Python process is way too slow, this could lead to the queue growing unbounded and spill into disk when run out of memory.
Here is a diagram to show how this works:
Downstream (for parent) / \ / socket (output of UDF) / \ RowQueue Python \ / \ socket (input of UDF) \ / upstream (from child)
The rows sent to and received from Python are packed into batches (100 rows) and serialized, there should be always some rows buffered in the socket or Python process, so the pulling from RowQueue ALWAYS happened after pushing into it.
trait EvalPythonUDTFExec extends SparkPlan with UnaryExecNode
A physical plan that evaluates a PythonUDTF, one partition of tuples at a time.
A physical plan that evaluates a PythonUDTF, one partition of tuples at a time. This is similar to EvalPythonExec.
case class FlatMapCoGroupsInArrowExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with FlatMapCoGroupsInBatchExec with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow
The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pyarrow.Tables, invokes the user-defined function, and passes the resulting pyarrow.Table as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pyarrow.Table. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.
trait FlatMapCoGroupsInBatchExec extends SparkPlan with BinaryExecNode with PythonSQLMetrics
Base class for Python-based FlatMapCoGroupsIn*Exec.
case class FlatMapCoGroupsInPandasExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with FlatMapCoGroupsInBatchExec with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas
The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pandas.DataFrames, invokes the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.
case class FlatMapGroupsInArrowExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with FlatMapGroupsInBatchExec with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow
Rows in each group are passed to the Python worker as an iterator of Arrow record batches. The Python worker passes the record batches either as a materialized pyarrow.Table or an iterator of pyarrow.RecordBatch, depending on the eval type of the user-defined function. The Python worker returns the resulting record batches which are turned into an Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: When using the pyarrow.Table API, the entire group is materialized in memory in the Python worker, and the entire result for a group must also be fully materialized. The iterator of record batches API can be used to avoid this limitation on the Python side.
trait FlatMapGroupsInBatchExec extends SparkPlan with UnaryExecNode with PythonSQLMetrics
Base class for Python-based FlatMapGroupsIn*Exec.
case class FlatMapGroupsInPandasExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with FlatMapGroupsInBatchExec with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas
Rows in each group are passed to the Python worker as an Arrow record batch. The Python worker turns the record batch to a pandas.DataFrame, invoke the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest group. The memory on the Java side is used to construct the record batch (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.
abstract class HybridQueue[T, Q <: Queue[T]] extends MemoryConsumer
A generic base class for hybrid queues that can store data either in memory or on disk.
A generic base class for hybrid queues that can store data either in memory or on disk. This class contains common logic for queue management, spilling, and memory management.
case class HybridRowQueue(memManager: TaskMemoryManager, tempDir: File, numFields: Int, serMgr: SerializerManager) extends HybridQueue[UnsafeRow, RowQueue] with Product with Serializable
A RowQueue that has a list of RowQueues, which could be in memory or disk.
A RowQueue that has a list of RowQueues, which could be in memory or disk.
HybridRowQueue could be safely appended in one thread, and pulled in another thread in the same time.
case class MapInArrowExec(func: Expression, output: Seq[Attribute], child: SparkPlan, isBarrier: Boolean, profile: Option[ResourceProfile]) extends SparkPlan with MapInBatchExec with Product with Serializable
A relation produced by applying a function that takes an iterator of PyArrow's record batches and outputs an iterator of PyArrow's record batches.
class MapInBatchEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow]
trait MapInBatchExec extends SparkPlan with UnaryExecNode with PythonSQLMetrics
A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.
A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.
This is somewhat similar with FlatMapGroupsInPandasExec and org.apache.spark.sql.catalyst.plans.logical.MapPartitionsInRWithArrow
case class MapInPandasExec(func: Expression, output: Seq[Attribute], child: SparkPlan, isBarrier: Boolean, profile: Option[ResourceProfile]) extends SparkPlan with MapInBatchExec with Product with Serializable
A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames.
abstract class PythonPlannerRunner[T] extends Logging
A helper class to run Python functions in Spark driver.
trait PythonSQLMetrics extends AnyRef
class PythonUDFRunner extends BasePythonUDFRunner
class PythonUDFWithNamedArgumentsRunner extends BasePythonUDFRunner
class PythonUDTFRunner extends BasePythonUDFRunner
trait Queue[T] extends AnyRef
abstract class RowInputArrowPythonRunner extends BaseArrowPythonRunner[Iterator[InternalRow], ColumnarBatch] with BasicPythonArrowInput with BasicPythonArrowOutput
trait RowQueue extends Queue[UnsafeRow]
A RowQueue is an FIFO queue for UnsafeRow.
A RowQueue is an FIFO queue for UnsafeRow.
This RowQueue is ONLY designed and used for Python UDF, which has only one writer and only one reader, the reader ALWAYS ran behind the writer. See the doc of class BatchEvalPythonExec on how it works.
class SliceBytesArrowOutputProcessorImpl extends BaseSliceArrowOutputProcessor
class SliceRecordsArrowOutputProcessorImpl extends BaseSliceArrowOutputProcessor
case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType, pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable
A user-defined Python function.
A user-defined Python function. This is used by the Python API.
case class UserDefinedPythonTableFunction(name: String, func: PythonFunction, returnType: Option[StructType], pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable
A user-defined Python table function.
A user-defined Python table function. This is used by the Python API.
class UserDefinedPythonTableFunctionAnalyzeRunner extends PythonPlannerRunner[PythonUDTFAnalyzeResult]
Runs the Python UDTF's analyze static method.
Runs the Python UDTF's analyze static method.
When the Python UDTF is defined without a static return type, the analyzer will call this while resolving table-valued functions.
This expects the Python UDTF to have analyze static method that take arguments:
- The number and order of arguments are the same as the UDTF inputs - Each argument is an AnalyzeArgument, containing:
- dataType: DataType
- value: Any: if the argument is foldable; otherwise None
- isTable: bool: True if the argument is TABLE
and that return an AnalyzeResult.
It serializes/deserializes the data types via JSON, and the values for the case the argument is foldable are pickled.
AnalysisException with the error class "TABLE_VALUED_FUNCTION_FAILED_TO_ANALYZE_IN_PYTHON" will be thrown when an exception is raised in Python.

Value Members

object ArrowAggregatePythonExec extends Serializable
object ArrowPythonRunner
object ArrowWindowPythonExec extends Serializable
object BatchEvalPythonExec extends Serializable
object EvalPythonExec extends Serializable
object EvaluatePython
object ExtractGroupingPythonUDFFromAggregate extends Rule[LogicalPlan]
Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate.
Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate. This must be executed after ExtractPythonUDFFromAggregate rule and before ExtractPythonUDFs.
object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]
Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, or doesn't depend on any above expressions, evaluate them after aggregate.
object ExtractPythonUDFs extends Rule[LogicalPlan] with Logging
Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Only extracts the PythonUDFs that could be evaluated in Python (the single child is PythonUDFs or all the children could be evaluated in JVM).
This has the limitation that the input to the Python UDF is not allowed include attributes from multiple child operators.
object ExtractPythonUDTFs extends Rule[LogicalPlan]
Extracts PythonUDTFs from operators, rewriting the query plan so that UDTFs can be evaluated.
object HybridRowQueue extends Serializable
object PythonSQLMetrics
object PythonUDFRunner
object PythonUDTFRunner
object QueueMode extends Enumeration
Enum to represent the storage mode for hybrid queues.

Packages

python

package python

Package Members

Type Members

Value Members

Ungrouped

Packages

python

package python

Package Members

Type Members

Value Members

Ungrouped

python