Searches a path recursively, returning the names of all directories in the tree whose name matches the given regex.
Searches a path recursively, returning the names of all directories in the tree whose name matches the given regex.
The path to begin the search at
A regular expression
A sequence of Path objects corresponding to the identified directories.
Loads alignments from a given path, and infers the input type.
Loads alignments from a given path, and infers the input type.
This method can load:
* AlignmentRecords via Parquet (default) * SAM/BAM (.sam, .bam) * FASTQ (interleaved, single end, paired end) (.ifq, .fq/.fastq) * FASTA (.fa, .fasta) * NucleotideContigFragments via Parquet (.contig.adam)
As hinted above, the input type is inferred from the file path extension.
Path to load data from.
The fields to project; ignored if not Parquet.
The path to load a second end of FASTQ data from. Ignored if not FASTQ.
Optional record group name to set if loading FASTQ.
Validation stringency used on FASTQ import/merging.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
loadFasta
loadFastq
loadInterleavedFastq
loadParquetAlignments
loadBam
Takes a sequence of Path objects and loads alignments using that path.
Takes a sequence of Path objects and loads alignments using that path.
This infers the type of each path, and thus can be used to load a mixture of different files from disk. I.e., if you want to load 2 BAM files and 3 Parquet files, this is the method you are looking for!
The RDDs obtained from loading each file are simply unioned together, while the record group dictionaries are naively merged. The sequence dictionaries are merged in a way that dedupes the sequence records in each dictionary.
The locations of the files to load.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
loadAlignments
Loads a SAM/BAM file.
Loads a SAM/BAM file.
This reads the sequence and record group dictionaries from the SAM/BAM file header. SAMRecords are read from the file and converted to the AlignmentRecord schema.
Path to the file on disk.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
loadAlignments
Loads Parquet file of Features to a CoverageRDD.
Loads Parquet file of Features to a CoverageRDD. Coverage is stored in the score attribute of Feature.
File path to load coverage from
CoverageRDD containing an RDD of Coverage
This method should create a new SequenceDictionary from any parquet file which contains records that have the requisite reference{Name,Id,Length,Url} fields.
This method should create a new SequenceDictionary from any parquet file which contains records that have the requisite reference{Name,Id,Length,Url} fields.
(If the path is a BAM or SAM file, and the implicit type is an Read, then it just defaults to reading the SequenceDictionary out of the BAM header in the normal way.)
The type of records to return
The path to the input data
A sequenceDictionary containing the names and indices of all the sequences to which the records in the corresponding file are aligned.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within the specified ReferenceRegions.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within the specified ReferenceRegions. Bam index file required.
The path to the input data. Currently this path must correspond to a single Bam file. The bam index file associated needs to have the same name.
Iterable of ReferenceRegions we are filtering on
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within a specified ReferenceRegion.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within a specified ReferenceRegion. Bam index file required.
The path to the input data. Currently this path must correspond to a single Bam file. The bam index file associated needs to have the same name.
The ReferenceRegion we are filtering on
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
The file to load.
Iterator of ReferenceRegions we are filtering on.
Returns a VariantContextRDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
The file to load.
ReferenceRegions we are filtering on.
Returns a VariantContextRDD.
This method will create a new RDD.
This method will create a new RDD.
The type of records to return
The path to the input data
An optional pushdown predicate to use when reading the data
An option projection schema to use when reading the data
An RDD with records of the specified type
Loads alignment data from a Parquet file.
Loads alignment data from a Parquet file.
The path of the file to load.
An optional predicate to push down into the file.
An optional schema designating the fields to project.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
The sequence dictionary is read from an avro file stored at filePath/_seqdict.avro and the record group dictionary is read from an avro file stored at filePath/_rgdict.avro. These files are pure avro, not Parquet.
loadAlignments
Loads a VCF file into an RDD.
Loads a VCF file into an RDD.
The file to load.
Returns a VariantContextRDD.