Packages

package python

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Package Members

  1. package streaming

Type Members

  1. case class ArrowAggregatePythonExec(groupingExpressions: Seq[NamedExpression], aggExpressions: Seq[AggregateExpression], resultExpressions: Seq[NamedExpression], child: SparkPlan, evalType: Int) extends SparkPlan with UnaryExecNode with PythonSQLMetrics with Product with Serializable

    Physical node for aggregation with group aggregate vectorized UDF.

    Physical node for aggregation with group aggregate vectorized UDF. Following eval types are supported:

    • SQL_GROUPED_AGG_ARROW_UDF for Arrow UDF
    • SQL_GROUPED_AGG_PANDAS_UDF for Pandas UDF

    This plan works by sending the necessary (projected) input grouped data as Arrow record batches to the python worker, the python worker invokes the UDF and sends the results to the executor, finally the executor evaluates any post-aggregation expressions and join the result with the grouped key.

  2. class ArrowEvalPythonEvaluatorFactory extends EvalPythonEvaluatorFactory
  3. case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a vectorized UDF.

    A physical plan that evaluates a vectorized UDF. Following eval types are supported:

    • SQL_ARROW_BATCHED_UDF for Arrow Optimized Python UDF
    • SQL_SCALAR_ARROW_UDF for Scalar Arrow UDF
    • SQL_SCALAR_ARROW_ITER_UDF for Scalar Iterator Arrow UDF
    • SQL_SCALAR_PANDAS_UDF for Scalar Pandas UDF
    • SQL_SCALAR_PANDAS_ITER_UDF for Scalar Iterator Pandas UDF
  4. case class ArrowEvalPythonUDTFExec(udtf: PythonUDTF, requiredChildOutput: Seq[Attribute], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonUDTFExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a PythonUDTF using Apache Arrow.

    A physical plan that evaluates a PythonUDTF using Apache Arrow. This is similar to ArrowEvalPythonExec.

    udtf

    the user-defined Python function.

    requiredChildOutput

    the required output of the child plan. It's used for omitting data generation that will be discarded next by a projection.

    resultAttrs

    the output schema of the Python UDTF.

    child

    the child plan.

    evalType

    the Python eval type.

  5. trait ArrowOutputProcessor extends AnyRef
  6. class ArrowOutputProcessorImpl extends ArrowOutputProcessor
  7. class ArrowPythonRunner extends RowInputArrowPythonRunner

    Similar to PythonUDFRunner, but exchange data with Python worker via Arrow stream.

  8. class ArrowPythonUDTFRunner extends BasePythonRunner[Iterator[InternalRow], ColumnarBatch] with BatchedPythonArrowInput with BasicPythonArrowOutput

    Similar to ArrowPythonRunner, but for PythonUDTFs.

  9. class ArrowPythonWithNamedArgumentRunner extends RowInputArrowPythonRunner

    Similar to PythonUDFWithNamedArgumentsRunner, but exchange data with Python worker via Arrow stream.

  10. class ArrowWindowPythonEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow] with WindowEvaluatorFactoryBase
  11. case class ArrowWindowPythonExec(windowExpression: Seq[NamedExpression], partitionSpec: Seq[Expression], orderSpec: Seq[SortOrder], child: SparkPlan, evalType: Int) extends SparkPlan with WindowExecBase with PythonSQLMetrics with Product with Serializable

    This class calculates and outputs windowed aggregates over the rows in a single partition.

    This class calculates and outputs windowed aggregates over the rows in a single partition. Following eval types are supported:

    • SQL_WINDOW_AGG_ARROW_UDF for Arrow UDF
    • SQL_WINDOW_AGG_PANDAS_UDF for Pandas UDF

    This is similar to WindowExec. The main difference is that this node does not compute any window aggregation values. Instead, it computes the lower and upper bound for each window (i.e. window bounds) and pass the data and indices to Python worker to do the actual window aggregation.

    It currently materializes all data associated with the same partition key and passes them to Python worker. This is not strictly necessary for sliding windows and can be improved (by possibly slicing data into overlapping chunks and stitching them together).

    This class groups window expressions by their window boundaries so that window expressions with the same window boundaries can share the same window bounds. The window bounds are prepended to the data passed to the python worker.

    For example, if we have: avg(v) over specifiedwindowframe(RowFrame, -5, 5), avg(v) over specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing), avg(v) over specifiedwindowframe(RowFrame, -3, 3), max(v) over specifiedwindowframe(RowFrame, -3, 3)

    The python input will look like: (lower_bound_w1, upper_bound_w1, lower_bound_w3, upper_bound_w3, v)

    where w1 is specifiedwindowframe(RowFrame, -5, 5) w2 is specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing) w3 is specifiedwindowframe(RowFrame, -3, 3)

    Note that w2 doesn't have bound indices in the python input because it's unbounded window so it's bound indices will always be the same.

    Bounded window and Unbounded window are evaluated differently in Python worker: (1) Bounded window takes the window bound indices in addition to the input columns. Unbounded window takes only input columns. (2) Bounded window evaluates the udf once per input row. Unbounded window evaluates the udf once per window partition. This is controlled by Python runner conf "window_bound_types"

    The logic to compute window bounds is delegated to WindowFunctionFrame and shared with WindowExec

    Note this doesn't support partial aggregation and all aggregation is computed from the entire window.

  12. case class AttachDistributedSequenceExec(sequenceAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    A physical plan that adds a new long column with sequenceAttr that increases one by one.

    A physical plan that adds a new long column with sequenceAttr that increases one by one. This is for 'distributed-sequence' default index in pandas API on Spark.

  13. abstract class BaseArrowPythonRunner[IN, OUT <: AnyRef] extends BasePythonRunner[IN, OUT] with PythonArrowInput[IN] with PythonArrowOutput[OUT]
  14. abstract class BasePythonUDFRunner extends BasePythonRunner[Array[Byte], Array[Byte]]

    A helper class to run Python UDFs in Spark.

  15. abstract class BaseSliceArrowOutputProcessor extends ArrowOutputProcessorImpl
  16. class BatchEvalPythonEvaluatorFactory extends EvalPythonEvaluatorFactory
  17. case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a PythonUDF

  18. case class BatchEvalPythonUDTFExec(udtf: PythonUDTF, requiredChildOutput: Seq[Attribute], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonUDTFExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a PythonUDTF.

    A physical plan that evaluates a PythonUDTF. This is similar to BatchEvalPythonExec.

    udtf

    the user-defined Python function

    requiredChildOutput

    the required output of the child plan. It's used for omitting data generation that will be discarded next by a projection.

    resultAttrs

    the output schema of the Python UDTF.

    child

    the child plan

  19. class CoGroupedArrowPythonRunner extends BasePythonRunner[(Iterator[InternalRow], Iterator[InternalRow]), ColumnarBatch] with BasicPythonArrowOutput

    Python UDF Runner for cogrouped udfs.

    Python UDF Runner for cogrouped udfs. It sends Arrow bathes from two different DataFrames, groups them in Python, and receive it back in JVM as batches of single DataFrame.

  20. abstract class EvalPythonEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow]
  21. trait EvalPythonExec extends SparkPlan with UnaryExecNode

    A physical plan that evaluates a PythonUDF, one partition of tuples at a time.

    A physical plan that evaluates a PythonUDF, one partition of tuples at a time.

    Python evaluation works by sending the necessary (projected) input data via a socket to an external Python process, and combine the result from the Python process with the original row.

    For each row we send to Python, we also put it in a queue first. For each output row from Python, we drain the queue to find the original input row. Note that if the Python process is way too slow, this could lead to the queue growing unbounded and spill into disk when run out of memory.

    Here is a diagram to show how this works:

    Downstream (for parent) / \ / socket (output of UDF) / \ RowQueue Python \ / \ socket (input of UDF) \ / upstream (from child)

    The rows sent to and received from Python are packed into batches (100 rows) and serialized, there should be always some rows buffered in the socket or Python process, so the pulling from RowQueue ALWAYS happened after pushing into it.

  22. trait EvalPythonUDTFExec extends SparkPlan with UnaryExecNode

    A physical plan that evaluates a PythonUDTF, one partition of tuples at a time.

    A physical plan that evaluates a PythonUDTF, one partition of tuples at a time. This is similar to EvalPythonExec.

  23. case class FlatMapCoGroupsInArrowExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with FlatMapCoGroupsInBatchExec with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInArrow

    The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pyarrow.Tables, invokes the user-defined function, and passes the resulting pyarrow.Table as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pyarrow.Table. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.

  24. trait FlatMapCoGroupsInBatchExec extends SparkPlan with BinaryExecNode with PythonSQLMetrics

    Base class for Python-based FlatMapCoGroupsIn*Exec.

  25. case class FlatMapCoGroupsInPandasExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with FlatMapCoGroupsInBatchExec with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas

    The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pandas.DataFrames, invokes the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.

  26. case class FlatMapGroupsInArrowExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with FlatMapGroupsInBatchExec with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInArrow

    Rows in each group are passed to the Python worker as an iterator of Arrow record batches. The Python worker passes the record batches either as a materialized pyarrow.Table or an iterator of pyarrow.RecordBatch, depending on the eval type of the user-defined function. The Python worker returns the resulting record batches which are turned into an Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: When using the pyarrow.Table API, the entire group is materialized in memory in the Python worker, and the entire result for a group must also be fully materialized. The iterator of record batches API can be used to avoid this limitation on the Python side.

  27. trait FlatMapGroupsInBatchExec extends SparkPlan with UnaryExecNode with PythonSQLMetrics

    Base class for Python-based FlatMapGroupsIn*Exec.

  28. case class FlatMapGroupsInPandasExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with FlatMapGroupsInBatchExec with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas

    Rows in each group are passed to the Python worker as an Arrow record batch. The Python worker turns the record batch to a pandas.DataFrame, invoke the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest group. The memory on the Java side is used to construct the record batch (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.

  29. abstract class HybridQueue[T, Q <: Queue[T]] extends MemoryConsumer

    A generic base class for hybrid queues that can store data either in memory or on disk.

    A generic base class for hybrid queues that can store data either in memory or on disk. This class contains common logic for queue management, spilling, and memory management.

  30. case class HybridRowQueue(memManager: TaskMemoryManager, tempDir: File, numFields: Int, serMgr: SerializerManager) extends HybridQueue[UnsafeRow, RowQueue] with Product with Serializable

    A RowQueue that has a list of RowQueues, which could be in memory or disk.

    A RowQueue that has a list of RowQueues, which could be in memory or disk.

    HybridRowQueue could be safely appended in one thread, and pulled in another thread in the same time.

  31. case class MapInArrowExec(func: Expression, output: Seq[Attribute], child: SparkPlan, isBarrier: Boolean, profile: Option[ResourceProfile]) extends SparkPlan with MapInBatchExec with Product with Serializable

    A relation produced by applying a function that takes an iterator of PyArrow's record batches and outputs an iterator of PyArrow's record batches.

  32. class MapInBatchEvaluatorFactory extends PartitionEvaluatorFactory[InternalRow, InternalRow]
  33. trait MapInBatchExec extends SparkPlan with UnaryExecNode with PythonSQLMetrics

    A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.

    A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.

    This is somewhat similar with FlatMapGroupsInPandasExec and org.apache.spark.sql.catalyst.plans.logical.MapPartitionsInRWithArrow

  34. case class MapInPandasExec(func: Expression, output: Seq[Attribute], child: SparkPlan, isBarrier: Boolean, profile: Option[ResourceProfile]) extends SparkPlan with MapInBatchExec with Product with Serializable

    A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames.

  35. abstract class PythonPlannerRunner[T] extends Logging

    A helper class to run Python functions in Spark driver.

  36. trait PythonSQLMetrics extends AnyRef
  37. class PythonUDFRunner extends BasePythonUDFRunner
  38. class PythonUDFWithNamedArgumentsRunner extends BasePythonUDFRunner
  39. class PythonUDTFRunner extends BasePythonUDFRunner
  40. trait Queue[T] extends AnyRef
  41. abstract class RowInputArrowPythonRunner extends BaseArrowPythonRunner[Iterator[InternalRow], ColumnarBatch] with BasicPythonArrowInput with BasicPythonArrowOutput
  42. trait RowQueue extends Queue[UnsafeRow]

    A RowQueue is an FIFO queue for UnsafeRow.

    A RowQueue is an FIFO queue for UnsafeRow.

    This RowQueue is ONLY designed and used for Python UDF, which has only one writer and only one reader, the reader ALWAYS ran behind the writer. See the doc of class BatchEvalPythonExec on how it works.

  43. class SliceBytesArrowOutputProcessorImpl extends BaseSliceArrowOutputProcessor
  44. class SliceRecordsArrowOutputProcessorImpl extends BaseSliceArrowOutputProcessor
  45. case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType, pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable

    A user-defined Python function.

    A user-defined Python function. This is used by the Python API.

  46. case class UserDefinedPythonTableFunction(name: String, func: PythonFunction, returnType: Option[StructType], pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable

    A user-defined Python table function.

    A user-defined Python table function. This is used by the Python API.

  47. class UserDefinedPythonTableFunctionAnalyzeRunner extends PythonPlannerRunner[PythonUDTFAnalyzeResult]

    Runs the Python UDTF's analyze static method.

    Runs the Python UDTF's analyze static method.

    When the Python UDTF is defined without a static return type, the analyzer will call this while resolving table-valued functions.

    This expects the Python UDTF to have analyze static method that take arguments:

    - The number and order of arguments are the same as the UDTF inputs - Each argument is an AnalyzeArgument, containing:

    • dataType: DataType
    • value: Any: if the argument is foldable; otherwise None
    • isTable: bool: True if the argument is TABLE

    and that return an AnalyzeResult.

    It serializes/deserializes the data types via JSON, and the values for the case the argument is foldable are pickled.

    AnalysisException with the error class "TABLE_VALUED_FUNCTION_FAILED_TO_ANALYZE_IN_PYTHON" will be thrown when an exception is raised in Python.

Value Members

  1. object ArrowAggregatePythonExec extends Serializable
  2. object ArrowPythonRunner
  3. object ArrowWindowPythonExec extends Serializable
  4. object BatchEvalPythonExec extends Serializable
  5. object EvalPythonExec extends Serializable
  6. object EvaluatePython
  7. object ExtractGroupingPythonUDFFromAggregate extends Rule[LogicalPlan]

    Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate.

    Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate. This must be executed after ExtractPythonUDFFromAggregate rule and before ExtractPythonUDFs.

  8. object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]

    Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, or doesn't depend on any above expressions, evaluate them after aggregate.

  9. object ExtractPythonUDFs extends Rule[LogicalPlan] with Logging

    Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.

    Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.

    Only extracts the PythonUDFs that could be evaluated in Python (the single child is PythonUDFs or all the children could be evaluated in JVM).

    This has the limitation that the input to the Python UDF is not allowed include attributes from multiple child operators.

  10. object ExtractPythonUDTFs extends Rule[LogicalPlan]

    Extracts PythonUDTFs from operators, rewriting the query plan so that UDTFs can be evaluated.

  11. object HybridRowQueue extends Serializable
  12. object PythonSQLMetrics
  13. object PythonUDFRunner
  14. object PythonUDTFRunner
  15. object QueueMode extends Enumeration

    Enum to represent the storage mode for hybrid queues.

Ungrouped