Calculates the coverage regions for an input set -- note that this input set is an Iterable, not an RDD.
Calculates the coverage regions for an input set -- note that this input set is an Iterable, not an RDD. This is the method which we call on each individual partition of the RDD, in order to calculate an initial set of disjoint-but-possibly-adjacent regions within the partition.
The input set of ReferenceRegion objects
The 'coverage regions' of the input set
This is a helper function for findCoverageRegions -- basically, it takes a set of input ReferenceRegions, it finds all pairs of regions that are adjacent to each other (i.
This is a helper function for findCoverageRegions -- basically, it takes a set of input ReferenceRegions, it finds all pairs of regions that are adjacent to each other (i.e. pairs (r1, r2) where r1.end == r2.start and r1.referenceName == r2.referenceName), and it collapses all such adjacent regions into single contiguous regions.
The input regions set; we assume that this input set is non-overlapping (that no two regions in the input set overlap each other)
The collapsed set of regions -- no two regions in the returned RDD should be adjacent, all should be at least one base-pair apart (or on separate chromosomes).
Calling findCoverageRegions calculates (as an RDD) the coverage regions for a given RDD of input regions.
Calling findCoverageRegions calculates (as an RDD) the coverage regions for a given RDD of input regions.
The primary method.
The input regions whose coverage regions are to be calculated
an RDD containing the ReferenceRegions corresponding to the coverage regions of the input set 'coveringRegions'
Uses the fixed window-width to key each Region by the corresponding window Region to which it belongs (through overlap).
Uses the fixed window-width to key each Region by the corresponding window Region to which it belongs (through overlap). Since a Region can overlap several windows, there may be >1 value in the resulting Seq.
An input Region which is to be keyed to 1 or more windows.
A Seq of Region pairs, where the first element of each pair is one of the windows (of fixed-width) and the second element is the input Region
A parameter (which should be a positive number) that determines the parallelism which Coverage uses to calculate the coverage regions -- larger window sizes indicate less parallelism, but also fewer subsequent passes.
A base is 'covered' by a region set if any region in the set contains the base itself.
The 'coverage regions' of a region set are the unique, disjoint, non-adjacent, minimal set of regions which contain every covered base, and no bases which are not covered.
The Coverage class calculates the coverage regions for a given region set.