Elaborates out a directory/glob/plain path.
Elaborates out a directory/glob/plain path.
Path to elaborate.
The underlying file system that this path is on.
Returns an array of Paths to load.
if the path does not match any files.
getFsAndFiles
Elaborates out a directory/glob/plain path.
Elaborates out a directory/glob/plain path.
Path to elaborate.
Returns an array of Paths to load.
if the path does not match any files.
getFiles
Elaborates out a directory/glob/plain path name.
Elaborates out a directory/glob/plain path name.
Path name to elaborate.
Filter to discard paths.
Returns an array of Paths to load.
if the path does not match any files.
getFiles
Load alignment records into an AlignmentRecordRDD.
Load alignment records into an AlignmentRecordRDD.
Loads path names ending in: * .bam/.cram/.sam as BAM/CRAM/SAM format, * .fa/.fasta as FASTA format, * .fq/.fastq as FASTQ format, and * .ifq as interleaved FASTQ format.
If none of these match, fall back to Parquet + Avro.
For FASTA, FASTQ, and interleaved FASTQ formats, compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load alignment records from. Globs/directories are supported, although file extension must be present for BAM/CRAM/SAM, FASTA, and FASTQ formats.
The optional path name to load the second set of alignment records from, if loading paired FASTQ format. Globs/directories are supported, although file extension must be present. Defaults to None.
The optional record group name to associate to the alignment records. Defaults to None.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating BAM/CRAM/SAM or FASTQ formats. Defaults to ValidationStringency.STRICT.
Returns an AlignmentRecordRDD which wraps the RDD of alignment records, sequence dictionary representing contigs the alignment records may be aligned to, and the record group dictionary for the alignment records if one is available.
loadParquetAlignments
loadInterleavedFastq
loadFasta
loadFastq
loadBam
Load alignment records from BAM/CRAM/SAM into an AlignmentRecordRDD.
Load alignment records from BAM/CRAM/SAM into an AlignmentRecordRDD.
This reads the sequence and record group dictionaries from the BAM/CRAM/SAM file header. SAMRecords are read from the file and converted to the AlignmentRecord schema.
The path name to load BAM/CRAM/SAM formatted alignment records from. Globs/directories are supported.
The validation stringency to use when validating the BAM/CRAM/SAM format header. Defaults to ValidationStringency.STRICT.
Returns an AlignmentRecordRDD which wraps the RDD of alignment records, sequence dictionary representing contigs the alignment records may be aligned to, and the record group dictionary for the alignment records if one is available.
Load a path name in BED6/12 format into a FeatureRDD.
Load a path name in BED6/12 format into a FeatureRDD.
The path name to load features in BED6/12 format from. Globs/directories are supported.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism. Defaults to None.
The validation stringency to use when validating BED6/12 format. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
Load nucleotide contig fragments into a NucleotideContigFragmentRDD.
Load nucleotide contig fragments into a NucleotideContigFragmentRDD.
If the path name has a .fa/.fasta extension, load as FASTA format. Else, fall back to Parquet + Avro.
For FASTA format, compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load nucleotide contig fragments from. Globs/directories are supported, although file extension must be present for FASTA format.
Maximum fragment length. Defaults to 10000L. Values greater than 1e9 should be avoided.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a NucleotideContigFragmentRDD.
loadParquetContigFragments
loadFasta
Load features into a FeatureRDD and convert to a CoverageRDD.
Load features into a FeatureRDD and convert to a CoverageRDD. Coverage is stored in the score field of Feature.
Loads path names ending in: * .bed as BED6/12 format, * .gff3 as GFF3 format, * .gtf/.gff as GTF/GFF2 format, * .narrow[pP]eak as NarrowPeak format, and * .interval_list as IntervalList format.
If none of these match, fall back to Parquet + Avro.
For BED6/12, GFF3, GTF/GFF2, NarrowPeak, and IntervalList formats, compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load features from. Globs/directories are supported, although file extension must be present for BED6/12, GFF3, GTF/GFF2, NarrowPeak, or IntervalList formats.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to use. For textual formats, if this is None, fall back to the Spark default parallelism. Defaults to None.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating BED6/12, GFF3, GTF/GFF2, NarrowPeak, or IntervalList formats. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD converted to a CoverageRDD.
loadParquetFeatures
loadIntervalList
loadNarrowPeak
loadGff3
loadGtf
loadBed
Load nucleotide contig fragments from FASTA into a NucleotideContigFragmentRDD.
Load nucleotide contig fragments from FASTA into a NucleotideContigFragmentRDD.
The path name to load nucleotide contig fragments from. Globs/directories are supported.
Maximum fragment length. Defaults to 10000L. Values greater than 1e9 should be avoided.
Returns a NucleotideContigFragmentRDD.
Load unaligned alignment records from (possibly paired) FASTQ into an AlignmentRecordRDD.
Load unaligned alignment records from (possibly paired) FASTQ into an AlignmentRecordRDD.
The path name to load the first set of unaligned alignment records from. Globs/directories are supported.
The path name to load the second set of unaligned alignment records from, if provided. Globs/directories are supported.
The optional record group name to associate to the unaligned alignment records. Defaults to None.
The validation stringency to use when validating (possibly paired) FASTQ format. Defaults to ValidationStringency.STRICT.
Returns an unaligned AlignmentRecordRDD.
loadUnpairedFastq
loadPairedFastq
Load features into a FeatureRDD.
Load features into a FeatureRDD.
Loads path names ending in: * .bed as BED6/12 format, * .gff3 as GFF3 format, * .gtf/.gff as GTF/GFF2 format, * .narrow[pP]eak as NarrowPeak format, and * .interval_list as IntervalList format.
If none of these match, fall back to Parquet + Avro.
For BED6/12, GFF3, GTF/GFF2, NarrowPeak, and IntervalList formats, compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load features from. Globs/directories are supported, although file extension must be present for BED6/12, GFF3, GTF/GFF2, NarrowPeak, or IntervalList formats.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to use. For textual formats, if this is None, fall back to the Spark default parallelism. Defaults to None.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating BED6/12, GFF3, GTF/GFF2, NarrowPeak, or IntervalList formats. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
loadParquetFeatures
loadIntervalList
loadNarrowPeak
loadGff3
loadGtf
loadBed
Load fragments into a FragmentRDD.
Load fragments into a FragmentRDD.
Loads path names ending in: * .bam/.cram/.sam as BAM/CRAM/SAM format and * .ifq as interleaved FASTQ format.
If none of these match, fall back to Parquet + Avro.
For interleaved FASTQ format, compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load fragments from. Globs/directories are supported, although file extension must be present for BAM/CRAM/SAM and FASTQ formats.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating BAM/CRAM/SAM or FASTQ formats. Defaults to ValidationStringency.STRICT.
Returns a FragmentRDD.
loadParquetFragments
loadInterleavedFastqAsFragments
loadAlignments
loadBam
Load genotypes into a GenotypeRDD.
Load genotypes into a GenotypeRDD.
If the path name has a .vcf/.vcf.gz/.vcf.bgz extension, load as VCF format. Else, fall back to Parquet + Avro.
The path name to load genotypes from. Globs/directories are supported, although file extension must be present for VCF format.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating VCF format. Defaults to ValidationStringency.STRICT.
Returns a GenotypeRDD.
loadParquetGenotypes
loadVcf
Load a path name in GFF3 format into a FeatureRDD.
Load a path name in GFF3 format into a FeatureRDD.
The path name to load features in GFF3 format from. Globs/directories are supported.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism. Defaults to None.
The validation stringency to use when validating GFF3 format. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
Load a path name in GTF/GFF2 format into a FeatureRDD.
Load a path name in GTF/GFF2 format into a FeatureRDD.
The path name to load features in GTF/GFF2 format from. Globs/directories are supported.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism. Defaults to None.
The validation stringency to use when validating GTF/GFF2 format. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
Functions like loadBam, but uses BAM index files to look at fewer blocks, and only returns records within the specified ReferenceRegions.
Functions like loadBam, but uses BAM index files to look at fewer blocks, and only returns records within the specified ReferenceRegions. BAM index file required.
The path name to load indexed BAM formatted alignment records from. Globs/directories are supported.
Iterable of ReferenceRegion we are filtering on.
The validation stringency to use when validating the BAM/CRAM/SAM format header. Defaults to ValidationStringency.STRICT.
Returns an AlignmentRecordRDD which wraps the RDD of alignment records, sequence dictionary representing contigs the alignment records may be aligned to, and the record group dictionary for the alignment records if one is available.
Functions like loadBam, but uses BAM index files to look at fewer blocks, and only returns records within a specified ReferenceRegion.
Functions like loadBam, but uses BAM index files to look at fewer blocks, and only returns records within a specified ReferenceRegion. BAM index file required.
The path name to load indexed BAM formatted alignment records from. Globs/directories are supported.
The ReferenceRegion we are filtering on.
Returns an AlignmentRecordRDD which wraps the RDD of alignment records, sequence dictionary representing contigs the alignment records may be aligned to, and the record group dictionary for the alignment records if one is available.
Load variant context records from VCF indexed by tabix (tbi) into a VariantContextRDD.
Load variant context records from VCF indexed by tabix (tbi) into a VariantContextRDD.
The path name to load VCF variant context records from. Globs/directories are supported.
Iterator of ReferenceRegions we are filtering on.
The validation stringency to use when validating VCF format. Defaults to ValidationStringency.STRICT.
Returns a VariantContextRDD.
Load variant context records from VCF indexed by tabix (tbi) into a VariantContextRDD.
Load variant context records from VCF indexed by tabix (tbi) into a VariantContextRDD.
The path name to load VCF variant context records from. Globs/directories are supported.
ReferenceRegion we are filtering on.
Returns a VariantContextRDD.
Load unaligned alignment records from interleaved FASTQ into an AlignmentRecordRDD.
Load unaligned alignment records from interleaved FASTQ into an AlignmentRecordRDD.
In interleaved FASTQ, the two reads from a paired sequencing protocol are interleaved in a single file. This is a zipped representation of the typical paired FASTQ.
The path name to load unaligned alignment records from. Globs/directories are supported.
Returns an unaligned AlignmentRecordRDD.
Load paired unaligned alignment records grouped by sequencing fragment from interleaved FASTQ into an FragmentRDD.
Load paired unaligned alignment records grouped by sequencing fragment from interleaved FASTQ into an FragmentRDD.
In interleaved FASTQ, the two reads from a paired sequencing protocol are interleaved in a single file. This is a zipped representation of the typical paired FASTQ.
Fragments represent all of the reads from a single sequenced fragment as a single object, which is a useful representation for some tasks.
The path name to load unaligned alignment records from. Globs/directories are supported.
Returns a FragmentRDD containing the paired reads grouped by sequencing fragment.
Load a path name in IntervalList format into a FeatureRDD.
Load a path name in IntervalList format into a FeatureRDD.
The path name to load features in IntervalList format from. Globs/directories are supported.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism. Defaults to None.
The validation stringency to use when validating IntervalList format. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
Load a path name in NarrowPeak format into a FeatureRDD.
Load a path name in NarrowPeak format into a FeatureRDD.
The path name to load features in NarrowPeak format from. Globs/directories are supported.
Optional sequence dictionary. Defaults to None.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism. Defaults to None.
The validation stringency to use when validating NarrowPeak format. Defaults to ValidationStringency.STRICT.
Returns a FeatureRDD.
Load unaligned alignment records from paired FASTQ into an AlignmentRecordRDD.
Load unaligned alignment records from paired FASTQ into an AlignmentRecordRDD.
The path name to load the first set of unaligned alignment records from. Globs/directories are supported.
The path name to load the second set of unaligned alignment records from. Globs/directories are supported.
The optional record group name to associate to the unaligned alignment records. Defaults to None.
The validation stringency to use when validating paired FASTQ format. Defaults to ValidationStringency.STRICT.
Returns an unaligned AlignmentRecordRDD.
Load a path name in Parquet + Avro format into an RDD.
Load a path name in Parquet + Avro format into an RDD.
The type of records to return.
The path name to load Parquet + Avro formatted data from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
An RDD with records of the specified type.
Load a path name in Parquet + Avro format into an AlignmentRecordRDD.
Load a path name in Parquet + Avro format into an AlignmentRecordRDD.
The path name to load alignment records from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns an AlignmentRecordRDD which wraps the RDD of alignment records, sequence dictionary representing contigs the alignment records may be aligned to, and the record group dictionary for the alignment records if one is available.
The sequence dictionary is read from an Avro file stored at pathName/_seqdict.avro and the record group dictionary is read from an Avro file stored at pathName/_rgdict.avro. These files are pure Avro, not Parquet + Avro.
Load a path name in Parquet + Avro format into a NucleotideContigFragmentRDD.
Load a path name in Parquet + Avro format into a NucleotideContigFragmentRDD.
The path name to load nucleotide contig fragments from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a NucleotideContigFragmentRDD.
Load a path name in Parquet + Avro format into a FeatureRDD and convert to a CoverageRDD.
Load a path name in Parquet + Avro format into a FeatureRDD and convert to a CoverageRDD. Coverage is stored in the score field of Feature.
The path name to load features from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
Forces loading the RDD.
Returns a FeatureRDD converted to a CoverageRDD.
Load a path name in Parquet + Avro format into a FeatureRDD.
Load a path name in Parquet + Avro format into a FeatureRDD.
The path name to load features from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a FeatureRDD.
Load a path name in Parquet + Avro format into a FragmentRDD.
Load a path name in Parquet + Avro format into a FragmentRDD.
The path name to load fragments from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a FragmentRDD.
Load a path name in Parquet + Avro format into a GenotypeRDD.
Load a path name in Parquet + Avro format into a GenotypeRDD.
The path name to load genotypes from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a GenotypeRDD.
Load a path name in Parquet + Avro format into a VariantRDD.
Load a path name in Parquet + Avro format into a VariantRDD.
The path name to load variants from. Globs/directories are supported.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
Returns a VariantRDD.
Load reference sequences into a broadcastable ReferenceFile.
Load reference sequences into a broadcastable ReferenceFile.
If the path name has a .2bit extension, loads a 2bit file. Else, uses loadContigFragments to load the reference as an RDD, which is then collected to the driver.
The path name to load reference sequences from. Globs/directories for 2bit format are not supported.
Maximum fragment length. Defaults to 10000L. Values greater than 1e9 should be avoided.
Returns a broadcastable ReferenceFile.
loadContigFragments
Load a sequence dictionary.
Load a sequence dictionary.
Loads path names ending in: * .dict as HTSJDK sequence dictionary format, * .genome as Bedtools genome file format, * .txt as UCSC Genome Browser chromInfo files.
Compressed files are supported through compression codecs configured in Hadoop, which by default include .gz and .bz2, but can include more.
The path name to load a sequence dictionary from.
Returns a sequence dictionary.
if pathName file extension not one of .dict, .genome, or .txt
Load unaligned alignment records from unpaired FASTQ into an AlignmentRecordRDD.
Load unaligned alignment records from unpaired FASTQ into an AlignmentRecordRDD.
The path name to load unaligned alignment records from. Globs/directories are supported.
If true, sets the unaligned alignment record as first from the fragment. Defaults to false.
If true, sets the unaligned alignment record as second from the fragment. Defaults to false.
The optional record group name to associate to the unaligned alignment records. Defaults to None.
The validation stringency to use when validating unpaired FASTQ format. Defaults to ValidationStringency.STRICT.
Returns an unaligned AlignmentRecordRDD.
Load variants into a VariantRDD.
Load variants into a VariantRDD.
If the path name has a .vcf/.vcf.gz/.vcf.bgz extension, load as VCF format. Else, fall back to Parquet + Avro.
The path name to load variants from. Globs/directories are supported, although file extension must be present for VCF format.
An optional pushdown predicate to use when reading Parquet + Avro. Defaults to None.
An option projection schema to use when reading Parquet + Avro. Defaults to None.
The validation stringency to use when validating VCF format. Defaults to ValidationStringency.STRICT.
Returns a VariantRDD.
loadParquetVariants
loadVcf
Load variant context records from VCF into a VariantContextRDD.
Load variant context records from VCF into a VariantContextRDD.
The path name to load VCF variant context records from. Globs/directories are supported.
The validation stringency to use when validating VCF format. Defaults to ValidationStringency.STRICT.
Returns a VariantContextRDD.
Load variant context records from VCF into a VariantContextRDD.
Load variant context records from VCF into a VariantContextRDD.
Only converts the core Genotype/Variant fields, and the fields set in the requested projection. Core variant fields include:
* Names (ID) * Filters (FILTER)
Core genotype fields include:
* Allelic depth (AD) * Read depth (DP) * Min read depth (MIN_DP) * Genotype quality (GQ) * Genotype likelihoods (GL/PL) * Strand bias components (SB) * Phase info (PS,PQ)
The path name to load VCF variant context records from. Globs/directories are supported.
The info fields to include, in addition to the ID and FILTER attributes.
The format fields to include, in addition to the core fields listed above.
The validation stringency to use when validating VCF format. Defaults to ValidationStringency.STRICT.
Returns a VariantContextRDD.
The SparkContext to wrap.
The ADAMContext provides functions on top of a SparkContext for loading genomic data.