Performs a region join between two RDDs (shuffle join).
Performs a region join between two RDDs (shuffle join).
This implementation is shuffle-based, so does not require collecting one side into memory like BroadcastRegionJoin. It basically performs a global sort of each RDD by genome position and then does a sort-merge join, similar to the chromsweep implementation in bedtools. More specifically, it first defines a set of bins across the genome, then assigns each object in the RDDs to each bin that they overlap (replicating if necessary), performs the shuffle, and sorts the object in each bin. Finally, each bin independently performs a chromsweep sort-merge join.
The 'left' side of the join
The 'right' side of the join
An RDD of pairs (x, y), where x is from leftRDD, y is from rightRDD, and the region corresponding to x overlaps the region corresponding to y.
A trait describing join implementations that are based on a sort-merge join.
The type of the left RDD.
The type of the right RDD.
The type of data yielded by the left RDD at the output of the join. This may not match T if the join is an outer join, etc.
The type of data yielded by the right RDD at the output of the join.