Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are identity (tuples remain unchanged) or hashed (tuples are converted into some hash index).
Represents a partitioning where rows are collected, transformed and broadcasted to each node in the cluster.
Specifies how tuples that share common expressions will be distributed when a query is executed in parallel on many machines. Distribution can be used to refer to two distinct physical properties:
Represents data where tuples have been ordered according to the ordering
Expressions. This is a strictly stronger guarantee than
ClusteredDistribution as an ordering will ensure that tuples that share the
same value for the ordering expressions are contiguous and will never be split across
partitions.
Describes how an operator's output is split across partitions. The compatibleWith
,
guarantees
, and satisfies
methods describe relationships between child partitionings,
target partitionings, and Distributions. These relations are described more precisely in
their individual method docs, but at a high level:
satisfies
is a relationship between partitionings and distributions.compatibleWith
is relationships between an operator's child output partitionings.guarantees
is a relationship between a child's existing output partitioning and a target
output partitioning.Diagrammatically:
+--------------+ | Distribution | +--------------+ ^{ | satisfies | +--------------+ +--------------+ | Child | | Target | +----| Partitioning |----guarantees--->| Partitioning | | +--------------+ +--------------+ | } | | | compatibleWith | | +------------+
A collection of Partitionings that can be used to describe the partitioning
scheme of the output of a physical operator. It is usually used for an operator
that has multiple children. In this case, a Partitioning in this collection
describes how this operator's output is partitioned based on expressions from
a child. For example, for a Join operator on two tables A
and B
with a join condition A.key1 = B.key2
, assuming we use HashPartitioning schema,
there are two Partitionings can be used to describe how the output of
this Join operator is partitioned, which are HashPartitioning(A.key1)
and
HashPartitioning(B.key2)
. It is also worth noting that partitionings
in this collection do not need to be equivalent, which is useful for
Outer Join operators.
Represents a partitioning where rows are split across partitions based on some total ordering of
the expressions specified in ordering
. When data is partitioned in this manner the following
two conditions are guaranteed to hold:
ordering
evaluate to the same values will be in the same
partition.min
and max
row, relative to the given ordering. All rows
that are in between min
and max
in this ordering
will reside in this partition.This class extends expression primarily so that transformations over expression will descend into its child.
Represents a partitioning where rows are distributed evenly across output partitions by starting from a random target partition number and distributing rows in a round-robin fashion. This partitioning is used when implementing the DataFrame.repartition() operator.
Represents a distribution that only has a single partition and all tuples of the dataset are co-located.
IdentityBroadcastMode requires that rows are broadcasted in their original form.
Represents a distribution where no promises are made about co-location of data.
Represents data where tuples are broadcasted to every node. It is quite common that the entire set of tuples is transformed into different data structure.