SparkContext
refinement predicate
filter predicate
RDD of elements on the left side of the cartisian join
RDD of elements on the right side of the cartisian join
RDD of elements on the left side of the cartisian join
RDD of elements on the right side of the cartisian join
Performs a cartesian join of two RDDs using filter and refine pattern.
During RDD declaration n*m partitions will be generated, one for each possible cartesian mapping. During RDD execution summary functions will be applied in a map-side reduce to
rrd1
andrdd2
. These results will be collected and filtered usingmetapred
for partitions with potential matches. Partition pairings with possible matches will be checked usingpred
in a refinement step.No shuffle from
rdd1
orrdd2
will be performed by the filter step, but the records of metardds, produced using the summary functions, will be shuffled (as they must be). The metardds contain one item per partition (ex: a "bounding box" of records in parent rdd), so it is assumed that this shuffle will be low cost.For efficient execution it is assumed that potential matches exist for limited number of cartesian pairings, if no filtering is possible worst case scenario is full cartesian product.