Class VariantContext

java.lang.Object
htsjdk.variant.variantcontext.VariantContext
All Implemented Interfaces:
HtsRecord, Locatable, Feature, Serializable

public class VariantContext extends Object implements HtsRecord, Feature, Serializable

High-level overview

The VariantContext object is a single general class system for representing genetic variation data composed of:
  • Allele: representing single genetic haplotypes (A, T, ATC, -) (note that null alleles are used here for illustration; see the Allele class for how to represent indels)
  • Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus
  • VariantContext: an abstract class holding all segregating alleles at a locus as well as genotypes for multiple individuals containing alleles at that locus

The class system works by defining segregating alleles, creating a variant context representing the segregating information at a locus, and potentially creating and associating genotypes with individuals in the context.

All of the classes are highly validating -- call validate() if you modify them -- so you can rely on the self-consistency of the data once you have a VariantContext in hand. The system has a rich set of assessor and manipulator routines, as well as more complex static support routines in VariantContextUtils.

The VariantContext (and Genotype) objects are attributed (supporting addition of arbitrary key/value pairs) and filtered (can represent a variation that is viewed as suspect).

VariantContexts are dynamically typed, so whether a VariantContext is a SNP, Indel, or NoVariant depends on the properties of the alleles in the context. See the detailed documentation on the Type parameter below.

It's also easy to create subcontexts based on selected genotypes.

Working with Variant Contexts

By default, VariantContexts are immutable. In order to access (in the rare circumstances where you need them) setter routines, you need to create MutableVariantContexts and MutableGenotypes.

Some example data

 Allele A, Aref, T, Tref;
 Allele del, delRef, ATC, ATCref;

A [ref] / T at 10

 
 GenomeLoc snpLoc = GenomeLocParser.createGenomeLoc("chr1", 10, 10);

A / ATC [ref] from 20-23

 GenomeLoc delLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 22);

// A [ref] / ATC immediately after 20

 GenomeLoc insLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 20);

Alleles

See the documentation in the Allele class itself

What are they?

Alleles can be either reference or non-reference

Examples of alleles used here:

   A = new Allele("A");
   Aref = new Allele("A", true);
   T = new Allele("T");
   ATC = new Allele("ATC");

Creating variant contexts

By hand

Here's an example of a A/T polymorphism with the A being reference:
 VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref, T));
 
If you want to create a non-variant site, just put in a single reference allele
 VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref));
 
A deletion is just as easy:
 VariantContext vc = new VariantContext(name, delLoc, Arrays.asList(ATCref, del));
 
The only thing that distinguishes between an insertion and deletion is which is the reference allele. An insertion has a reference allele that is smaller than the non-reference allele, and vice versa for deletions.
 VariantContext vc = new VariantContext("name", insLoc, Arrays.asList(delRef, ATC));
 

Converting rods and other data structures to VariantContexts

You can convert many common types into VariantContexts using the general function:
 VariantContextAdaptors.convertToVariantContext(name, myObject)
 
dbSNP and VCFs, for example, can be passed in as myObject and a VariantContext corresponding to that object will be returned. A null return value indicates that the type isn't yet supported. This is the best and easiest way to create contexts using RODs.

Working with genotypes

 List<Allele> alleles = Arrays.asList(Aref, T);
 Genotype g1 = new Genotype(Arrays.asList(Aref, Aref), "g1", 10);
 Genotype g2 = new Genotype(Arrays.asList(Aref, T), "g2", 10);
 Genotype g3 = new Genotype(Arrays.asList(T, T), "g3", 10);
 VariantContext vc = new VariantContext(snpLoc, alleles, Arrays.asList(g1, g2, g3));
 
At this point we have 3 genotypes in our context, g1-g3. You can assess a good deal of information about the genotypes through the VariantContext:
 vc.hasGenotypes()
 vc.isMonomorphicInSamples()
 vc.isPolymorphicInSamples()
 vc.getSamples().size()

 vc.getGenotypes()
 vc.getGenotypes().get("g1")
 vc.hasGenotype("g1")

 vc.getCalledChrCount()
 vc.getCalledChrCount(Aref)
 vc.getCalledChrCount(T)
 

NO_CALL alleles

The system allows one to create Genotypes carrying special NO_CALL alleles that aren't present in the set of context alleles and that represent undetermined alleles in a genotype:
 Genotype g4 = new Genotype(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL), "NO_DATA_FOR_SAMPLE", 10);

subcontexts

It's also very easy get subcontext based only the data in a subset of the genotypes:
 VariantContext vc12 = vc.subContextFromGenotypes(Arrays.asList(g1,g2));
 VariantContext vc1 = vc.subContextFromGenotypes(Arrays.asList(g1));
 

Fully decoding.

Currently VariantContexts support some fields, particularly those stored as generic attributes, to be of any type. For example, a field AB might be naturally a floating point number, 0.51, but when it's read into a VC its not decoded into the Java presentation but left as a string "0.51". A fully decoded VariantContext is one where all values have been converted to their corresponding Java object types, based on the types declared in a VCFHeader. The fullyDecode(...) method takes a header object and creates a new fully decoded VariantContext where all fields are converted to their true java representation. The VCBuilder can be told that all fields are fully decoded, in which case no work is done when asking for a fully decoded version of the VC.
See Also:
  • Field Details

    • serialVersionUID

      public static final long serialVersionUID
      See Also:
    • commonInfo

      protected CommonInfo commonInfo
    • NO_LOG10_PERROR

      public static final double NO_LOG10_PERROR
      See Also:
    • PASSES_FILTERS

      public static final Set<String> PASSES_FILTERS
    • contig

      protected final String contig
      The location of this VariantContext
    • start

      protected final long start
    • stop

      protected final long stop
    • type

      protected VariantContext.Type type
      The type (cached for performance reasons) of this context
    • typeIgnoringNonRef

      protected VariantContext.Type typeIgnoringNonRef
      The type of this context, cached separately if ignoreNonRef is true
    • alleles

      protected final List<Allele> alleles
      A set of the alleles segregating in this context
    • genotypes

      protected GenotypesContext genotypes
      A mapping from sampleName -> genotype objects for all genotypes associated with this context
    • genotypeCounts

      protected int[] genotypeCounts
      Counts for each of the possible Genotype types in this context
    • NO_GENOTYPES

      public static final GenotypesContext NO_GENOTYPES
    • VALID_FILTER

      public static final Pattern VALID_FILTER
  • Constructor Details

    • VariantContext

      protected VariantContext(VariantContext other)
      Copy constructor
      Parameters:
      other - the VariantContext to copy
    • VariantContext

      protected VariantContext(String source, String ID, String contig, long start, long stop, Collection<Allele> alleles, GenotypesContext genotypes, double log10PError, Set<String> filters, Map<String,Object> attributes, boolean fullyDecoded, EnumSet<VariantContext.Validation> validationToPerform)
      the actual constructor. Private access only
      Parameters:
      source - source
      contig - the contig
      start - the start base (one based)
      stop - the stop reference base (one based)
      alleles - alleles
      genotypes - genotypes map
      log10PError - qual
      filters - filters: use null for unfiltered and empty set for passes filters
      attributes - attributes
      validationToPerform - set of validation steps to take
  • Method Details

    • calcVCFGenotypeKeys

      public List<String> calcVCFGenotypeKeys(VCFHeader header)
    • subContextFromSamples

      public VariantContext subContextFromSamples(Set<String> sampleNames, boolean rederiveAllelesFromGenotypes)
      This method subsets down to a set of samples. At the same time returns the alleles to just those in use by the samples, if rederiveAllelesFromGenotypes is true, otherwise the full set of alleles in this VC is returned as the set of alleles in the subContext, even if some of those alleles aren't in the samples WARNING: BE CAREFUL WITH rederiveAllelesFromGenotypes UNLESS YOU KNOW WHAT YOU ARE DOING
      Parameters:
      sampleNames - the sample names
      rederiveAllelesFromGenotypes - if true, returns the alleles to just those in use by the samples, true should be default
      Returns:
      new VariantContext subsetting to just the given samples
    • subContextFromSamples

      public VariantContext subContextFromSamples(Set<String> sampleNames)
      Parameters:
      sampleNames -
      Returns:
      See Also:
    • subContextFromSample

      public VariantContext subContextFromSample(String sampleName)
    • getType

      public VariantContext.Type getType()
      Determines (if necessary) and returns the type of this variation by examining the alleles it contains.
      Returns:
      the type of this VariantContext
    • getType

      public VariantContext.Type getType(boolean ignoreNonRef)
      Determines (if necessary) and returns the type of this variation by examining the alleles it contains.
      Parameters:
      ignoreNonRef - If set to true, symbolic NON_REF alleles will not be considered for the type determination, which is required for handling GVCF files.
      Returns:
      the type of this VariantContext
    • isSNP

      public boolean isSNP()
      convenience method for SNPs
      Returns:
      true if this is a SNP, false otherwise
    • isVariant

      public boolean isVariant()
      convenience method for variants
      Returns:
      true if this is a variant allele, false if it's reference
    • isPointEvent

      public boolean isPointEvent()
      convenience method for point events
      Returns:
      true if this is a SNP or ref site, false if it's an indel or mixed event
    • isIndel

      public boolean isIndel()
      convenience method for indels
      Returns:
      true if this is an indel, false otherwise
    • isSimpleInsertion

      public boolean isSimpleInsertion()
      Returns:
      true if the alleles indicate a simple insertion (i.e., the reference allele is Null)
    • isSimpleDeletion

      public boolean isSimpleDeletion()
      Returns:
      true if the alleles indicate a simple deletion (i.e., a single alt allele that is Null)
    • isSimpleIndel

      public boolean isSimpleIndel()
      Returns:
      true if the alleles indicate a simple indel, false otherwise.
    • isComplexIndel

      public boolean isComplexIndel()
      Returns:
      true if the alleles indicate neither a simple deletion nor a simple insertion
    • isSymbolic

      public boolean isSymbolic()
    • isStructuralIndel

      public boolean isStructuralIndel()
    • isSymbolicOrSV

      public boolean isSymbolicOrSV()
      Returns:
      true if the variant is symbolic or a large indel
    • isMNP

      public boolean isMNP()
    • isMixed

      public boolean isMixed()
      convenience method for indels
      Returns:
      true if this is an mixed variation, false otherwise
    • hasID

      public boolean hasID()
    • emptyID

      public boolean emptyID()
    • getID

      public String getID()
    • getSource

      public String getSource()
    • getFiltersMaybeNull

      public Set<String> getFiltersMaybeNull()
    • getFilters

      public Set<String> getFilters()
    • isFiltered

      public boolean isFiltered()
    • isNotFiltered

      public boolean isNotFiltered()
    • filtersWereApplied

      public boolean filtersWereApplied()
    • hasLog10PError

      public boolean hasLog10PError()
    • getLog10PError

      public double getLog10PError()
    • getPhredScaledQual

      public double getPhredScaledQual()
    • getAttributes

      public Map<String,Object> getAttributes()
    • hasAttribute

      public boolean hasAttribute(String key)
    • getAttribute

      public Object getAttribute(String key)
    • getAttribute

      public Object getAttribute(String key, Object defaultValue)
    • getAttributeAsString

      public String getAttributeAsString(String key, String defaultValue)
    • getAttributeAsInt

      public int getAttributeAsInt(String key, int defaultValue)
    • getAttributeAsDouble

      public double getAttributeAsDouble(String key, double defaultValue)
    • getAttributeAsBoolean

      public boolean getAttributeAsBoolean(String key, boolean defaultValue)
    • getAttributeAsList

      public List<Object> getAttributeAsList(String key)
      returns the value as an empty list if the key was not found, as a java.util.List if the value is a List or an Array, as a Collections.singletonList if there is only one value
    • getAttributeAsStringList

      public List<String> getAttributeAsStringList(String key, String defaultValue)
    • getAttributeAsIntList

      public List<Integer> getAttributeAsIntList(String key, int defaultValue)
    • getAttributeAsDoubleList

      public List<Double> getAttributeAsDoubleList(String key, double defaultValue)
    • getCommonInfo

      public CommonInfo getCommonInfo()
    • getReference

      public Allele getReference()
      Returns:
      the reference allele for this context
    • isBiallelic

      public boolean isBiallelic()
      Returns:
      true if the context is strictly bi-allelic
    • getNAlleles

      public int getNAlleles()
      Returns:
      The number of segregating alleles in this context
    • getMaxPloidy

      public int getMaxPloidy(int defaultPloidy)
      Returns the maximum ploidy of all samples in this VC, or default if there are no genotypes This function is caching, so it's only expensive on the first call
      Parameters:
      defaultPloidy - the default ploidy, if all samples are no-called
      Returns:
      default, or the max ploidy
    • getAllele

      public Allele getAllele(String allele)
      Returns:
      The allele sharing the same bases as this String. A convenience method; better to use byte[]
    • getAllele

      public Allele getAllele(byte[] allele)
      Returns:
      The allele sharing the same bases as this byte[], or null if no such allele is present.
    • hasAllele

      public boolean hasAllele(Allele allele)
      Returns:
      True if this context contains Allele allele, or false otherwise
    • hasAllele

      public boolean hasAllele(Allele allele, boolean ignoreRefState)
    • hasAlternateAllele

      public boolean hasAlternateAllele(Allele allele)
    • hasAlternateAllele

      public boolean hasAlternateAllele(Allele allele, boolean ignoreRefState)
    • getAlleles

      public List<Allele> getAlleles()
      Gets the alleles. This method should return all of the alleles present at the location, including the reference allele. There are no constraints imposed on the ordering of alleles in the set. If the reference is not an allele in this context it will not be included.
      Returns:
      the set of alleles
    • getAlternateAlleles

      public List<Allele> getAlternateAlleles()
      Gets the alternate alleles. This method should return all the alleles present at the location, NOT including the reference allele. There are no constraints imposed on the ordering of alleles in the set.
      Returns:
      the set of alternate alleles
    • getIndelLengths

      public List<Integer> getIndelLengths()
      Gets the sizes of the alternate alleles if they are insertion/deletion events, and returns a list of their sizes
      Returns:
      a list of indel lengths ( null if not of type indel or mixed )
    • getAlternateAllele

      public Allele getAlternateAllele(int i)
      Parameters:
      i - -- the ith allele (from 0 to n - 2 for a context with n alleles including a reference allele)
      Returns:
      the ith non-reference allele in this context
      Throws:
      IllegalArgumentException - if i is invalid
    • hasSameAllelesAs

      public boolean hasSameAllelesAs(VariantContext other)
      Parameters:
      other - VariantContext whose alleles to compare against
      Returns:
      true if this VariantContext has the same alleles (both ref and alts) as other, regardless of ordering. Otherwise returns false.
    • hasSameAlternateAllelesAs

      public boolean hasSameAlternateAllelesAs(VariantContext other)
      Parameters:
      other - VariantContext whose alternate alleles to compare against
      Returns:
      true if this VariantContext has the same alternate alleles as other, regardless of ordering. Otherwise returns false.
    • getNSamples

      public int getNSamples()
      Returns:
      the number of samples in the context
    • hasGenotypes

      public boolean hasGenotypes()
      Returns:
      true if the context has associated genotypes
    • hasGenotypes

      public boolean hasGenotypes(Collection<String> sampleNames)
    • getGenotypes

      public GenotypesContext getGenotypes()
      Returns:
      set of all Genotypes associated with this context
    • getGenotypesOrderedByName

      public Iterable<Genotype> getGenotypesOrderedByName()
    • getGenotypesOrderedBy

      public Iterable<Genotype> getGenotypesOrderedBy(Iterable<String> sampleOrdering)
    • getGenotypes

      public GenotypesContext getGenotypes(String sampleName)
      Returns a map from sampleName -> Genotype for the genotype associated with sampleName. Returns a map for consistency with the multi-get function.
      Parameters:
      sampleName - the sample name
      Returns:
      mapping from sample name to genotype
      Throws:
      IllegalArgumentException - if sampleName isn't bound to a genotype
    • getGenotypes

      protected GenotypesContext getGenotypes(Collection<String> sampleNames)
      Returns a map from sampleName -> Genotype for each sampleName in sampleNames. Returns a map for consistency with the multi-get function. For testing convenience only
      Parameters:
      sampleNames - a unique list of sample names
      Returns:
      subsetting genotypes context
      Throws:
      IllegalArgumentException - if sampleName isn't bound to a genotype
    • getGenotypes

      public GenotypesContext getGenotypes(Set<String> sampleNames)
    • getSampleNames

      public Set<String> getSampleNames()
      Returns:
      the set of all sample names in this context, not ordered
    • getSampleNamesOrderedByName

      public List<String> getSampleNamesOrderedByName()
    • getGenotype

      public Genotype getGenotype(String sample)
      Parameters:
      sample - the sample name
      Returns:
      the Genotype associated with the given sample in this context or null if the sample is not in this context
    • hasGenotype

      public boolean hasGenotype(String sample)
    • getGenotype

      public Genotype getGenotype(int ith)
      Parameters:
      ith - the sample index
      Returns:
      the ith genotype in this context or null if there aren't that many genotypes
    • getCalledChrCount

      public int getCalledChrCount()
      Returns the number of chromosomes carrying any allele in the genotypes (i.e., excluding NO_CALLS)
      Returns:
      chromosome count
    • getCalledChrCount

      public int getCalledChrCount(Set<String> sampleIds)
      Returns the number of chromosomes carrying any allele in the genotypes (i.e., excluding NO_CALLS)
      Parameters:
      sampleIds - IDs of samples to take into account. If empty then all samples are included.
      Returns:
      chromosome count
    • getCalledChrCount

      public int getCalledChrCount(Allele a)
      Returns the number of chromosomes carrying allele A in the genotypes
      Parameters:
      a - allele
      Returns:
      chromosome count
    • getCalledChrCount

      public int getCalledChrCount(Allele a, Set<String> sampleIds)
      Returns the number of chromosomes carrying allele A in the genotypes
      Parameters:
      a - allele
      sampleIds - - IDs of samples to take into account. If empty then all samples are included.
      Returns:
      chromosome count
    • isMonomorphicInSamples

      public boolean isMonomorphicInSamples()
      Genotype-specific functions -- are the genotypes monomorphic w.r.t. to the alleles segregating at this site? That is, is the number of alternate alleles among all fo the genotype == 0?
      Returns:
      true if it's monomorphic
    • isPolymorphicInSamples

      public boolean isPolymorphicInSamples()
      Genotype-specific functions -- are the genotypes polymorphic w.r.t. to the alleles segregating at this site? That is, is the number of alternate alleles among all fo the genotype > 0?
      Returns:
      true if it's polymorphic
    • getNoCallCount

      public int getNoCallCount()
      Genotype-specific functions -- how many no-calls are there in the genotypes?
      Returns:
      number of no calls
    • getHomRefCount

      public int getHomRefCount()
      Genotype-specific functions -- how many hom ref calls are there in the genotypes?
      Returns:
      number of hom ref calls
    • getHetCount

      public int getHetCount()
      Genotype-specific functions -- how many het calls are there in the genotypes?
      Returns:
      number of het calls
    • getHomVarCount

      public int getHomVarCount()
      Genotype-specific functions -- how many hom var calls are there in the genotypes?
      Returns:
      number of hom var calls
    • getMixedCount

      public int getMixedCount()
      Genotype-specific functions -- how many mixed calls are there in the genotypes?
      Returns:
      number of mixed calls
    • extraStrictValidation

      public void extraStrictValidation(Allele reportedReference, Allele observedReference, Set<String> rsIDs)
      Run all extra-strict validation tests on a Variant Context object
      Parameters:
      reportedReference - the reported reference allele
      observedReference - the observed reference allele
      rsIDs - the true dbSNP IDs
    • validateReferenceBases

      public void validateReferenceBases(Allele reportedReference, Allele observedReference)
    • validateRSIDs

      public void validateRSIDs(Set<String> rsIDs)
    • validateAlternateAlleles

      public void validateAlternateAlleles()
    • validateChromosomeCounts

      public void validateChromosomeCounts()
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • toStringDecodeGenotypes

      public String toStringDecodeGenotypes()
    • toStringWithoutGenotypes

      public String toStringWithoutGenotypes()
    • fullyDecode

      public VariantContext fullyDecode(VCFHeader header, boolean lenientDecoding)
      Return a VC equivalent to this one but where all fields are fully decoded See VariantContext document about fully decoded
      Parameters:
      header - containing types about all fields in this VC
      Returns:
      a fully decoded version of this VC
    • isFullyDecoded

      public boolean isFullyDecoded()
      See VariantContext document about fully decoded
      Returns:
      true if this is a fully decoded VC
    • getContig

      public String getContig()
      Description copied from interface: Locatable
      Gets the contig name for the contig this is mapped to. May return null if there is no unique mapping.
      Specified by:
      getContig in interface Locatable
      Returns:
      name of the contig this is mapped to, potentially null
    • getStart

      public int getStart()
      Returns 1-based inclusive start position of the variant.

      INDEL events usually start on the first unaltered reference base before the INDEL.

      Warning: be aware that the start position of the VariantContext is defined in terms of the start position specified in the underlying vcf file, VariantContexts representing the same biological event may have different start positions depending on the specifics of the vcf file they are derived from.

      Warning: Note also that the VCF spec allows 0 and N + 1 for POS field for telomeric event, where N is the length of the chromosome. The "0" value returned should be interpreted as telomere, and does not violate the above "1-based" comment. Code consuming the returned start should be prepared for such out-of-the-ordinary values.

      Specified by:
      getStart in interface Locatable
      Returns:
      0 or greater.
    • getEnd

      public int getEnd()
      Specified by:
      getEnd in interface Locatable
      Returns:
      1-based closed end position of the Variant If the END info field is specified that value is returned, otherwise the end is the start + reference allele length - 1. For VariantContexts with a single alternate allele, if that allele is an insertion, the end position will be on the reference base before the insertion event. If the single alt allele is a deletion, the end will be on the final deleted reference base.
    • isReferenceBlock

      public boolean isReferenceBlock()
      Returns:
      true if the variant context is a reference block
    • hasSymbolicAlleles

      public boolean hasSymbolicAlleles()
    • hasSymbolicAlleles

      public static boolean hasSymbolicAlleles(List<Allele> alleles)
    • getAltAlleleWithHighestAlleleCount

      public Allele getAltAlleleWithHighestAlleleCount()
    • getAlleleIndex

      public int getAlleleIndex(Allele allele)
      Lookup the index of allele in this variant context
      Parameters:
      allele - the allele whose index we want to get
      Returns:
      the index of the allele into getAlleles(), or -1 if it cannot be found
    • getAlleleIndices

      public List<Integer> getAlleleIndices(Collection<Allele> alleles)
      Return the allele index #getAlleleIndex for each allele in alleles
      Parameters:
      alleles - the alleles we want to look up
      Returns:
      a list of indices for each allele, in order
    • getGLIndecesOfAlternateAllele

      @Deprecated public int[] getGLIndecesOfAlternateAllele(Allele targetAllele)
      Deprecated.
    • getGLIndicesOfAlternateAllele

      public int[] getGLIndicesOfAlternateAllele(Allele targetAllele)
    • getStructuralVariantType

      public StructuralVariantType getStructuralVariantType()
      Search for the INFO=SVTYPE and return the type of Structural Variant
      Returns:
      the StructuralVariantType of null if there is no property SVTYPE