Performs a cartesian join of two RDDs using filter and refine pattern.
Performs a cartesian join of two RDDs using filter and refine pattern.
During RDD declaration n*m partitions will be generated, one for each possible cartesian mapping.
During RDD execution summary functions will be applied in a map-side reduce to rrd1
and rdd2
.
These results will be collected and filtered using metapred
for partitions with potential matches.
Partition pairings with possible matches will be checked using pred
in a refinement step.
No shuffle from rdd1
or rdd2
will be performed by the filter step,
but the records of metardds, produced using the summary functions, will be shuffled (as they must be).
The metardds contain one item per partition (ex: a "bounding box" of records in parent rdd),
so it is assumed that this shuffle will be low cost.
For efficient execution it is assumed that potential matches exist for limited number of cartesian pairings, if no filtering is possible worst case scenario is full cartesian product.