Expression that adds fields to an existing struct.
An expression that allows users to aggregate over all array elements at a specific index in an array column.
An expression that allows users to aggregate over all array elements at a specific index in an array column. For example, this expression can be used to compute per-sample summary statistics from a genotypes column.
The user must provide the following arguments: - The array for aggregation - The initialValue for each element in the per-index buffer - An update function to update the buffer with a new element - A merge function to combine two buffers
The user may optionally provide an evaluate function. If it's not provided, the identity function is used.
Example usage to calculate average depth across all sites for a sample: aggregate_by_index( genotypes, named_struct('sum', 0l, 'count', 0l), (buf, genotype) -> named_struct('sum', buf.sum + genotype.depth, 'count', buf.count + 1), (buf1, buf2) -> named_struct('sum', buf1.sum + buf2.sum, 'count', buf1.count + buf2.count), buf -> buf.sum / buf.count)
Computes summary statistics per-sample in a genomic cohort.
Computes summary statistics per-sample in a genomic cohort. These statistics include the call rate and the number of different types of variants.
The return type is an array of summary statistics. If sample ids are included in the input schema, they'll be propagated to the results.
Context that can be computed once for all variant sites for a linear regression GWAS analysis.
Expands all the fields of a potentially unnamed struct.
Explodes a matrix by row.
Explodes a matrix by row. Each row of the input matrix will be output as an array of doubles.
If the input expression is null or has 0 rows, the output will be empty.
The matrix to explode. May be dense or sparse.
Converts a complex genotype array into an array of ints, where each element is the sum of the calls array for the sample at that position if no calls are missing, or -1 if any calls are missing.
Converts an array of probabilities (most likely the genotype probabilities from a BGEN file) into hard calls.
Converts an array of probabilities (most likely the genotype probabilities from a BGEN file) into hard calls. The input probabilities are assumed to be diploid.
If the input probabilities are phased, each haplotype is called separately by finding the maximum probability greater than the threshold (0.9 by default, a la plink). If no probability is greater than the threshold, the call is -1 (missing).
If the input probabilities are unphased, the probabilities refer to the complete genotype. In this case, we find the maximum probability greater than the threshold and then convert that value to a genotype call.
If any of the required parameters (probabilities, numAlts, phased) are null, the expression returns null.
The probabilities to convert to hard calls. The algorithm does not check that they sum to 1. If the probabilities are unphased, they are assumed to correspond to the genotypes in colex order, which is standard for both BGEN and VCF files.
The number of alternate alleles at this site.
Whether the probabilities are phased (per haplotype) or unphased (whole genotype).
Calls are only generated if at least one probability is above this threshold.
Performs lift over from the specified 0-start, half-open interval (contigName, start, end) on the reference sequence to a query sequence, using the specified chain file and minimum fraction of bases that must remap.
Performs lift over from the specified 0-start, half-open interval (contigName, start, end) on the reference sequence to a query sequence, using the specified chain file and minimum fraction of bases that must remap.
We assume the chain file is a constant value so that the LiftOver object can be reused between rows.
If any of the required parameters (contigName, start, end) are null, the expression returns null. If minMatchRatioOpt contains null, the expression returns null; if it is empty, we use 0.95 to match LiftOver.DEFAULT_LIFTOVER_MINMATCH.
Chromosome name on the reference sequence.
Start position (0-start) on the reference sequence.
End position on the reference sequence.
UCSC chain format file mapping blocks from the reference sequence to the query sequence.
The minimum fraction of bases that must remap to lift over successfully.
Base trait for logistic regression tests
Statistics returned upon performing a logit test.
Statistics returned upon performing a logit test.
Log-odds associated with the genotype, NaN if the null/full model fit failed
Odds ratio associated with the genotype, NaN if the null/full model fit failed
Wald 95% confidence interval of the odds ratio, NaN if the null/full model fit failed
P-value for the specified test, NaN if the null/full model fit failed. Determined using the profile likelihood method.
Substitutes the missing values of an array using the mean of the non-missing values.
Substitutes the missing values of an array using the mean of the non-missing values. Values that are NaN, null or equal to the missing value parameter are not included in the aggregation, and are substituted with the mean of the non-missing values. If all values are missing, they are substituted with the missing value.
If the missing value is not provided, the parameter defaults to -1.
The state necessary for maintaining moment based aggregations, currently only supported up to m2.
The state necessary for maintaining moment based aggregations, currently only supported up to m2.
This functionality is based on the org.apache.spark.sql.catalyst.expressions.aggregate.CentralMomentAgg implementation in Spark and is used to compute summary statistics on arrays as well across many rows for sample based aggregations.
Computes summary statistics (count, min, max, mean, stdev) for a numeric genotype field for each sample in a cohort.
Computes summary statistics (count, min, max, mean, stdev) for a numeric genotype field for each sample in a cohort. The field is determined by the provided StructField. If the field does not exist in the genotype struct, an analysis error will be thrown.
The return type is an array of summary statistics. If sample ids are included in the input, they'll be propagated to the results.
A hack to make Spark SQL recognize AggregateByIndex as an aggregate expression.
A hack to make Spark SQL recognize AggregateByIndex as an aggregate expression.
See io.projectglow.sql.optimizer.ResolveAggregateFunctionsRule for details.
Some of the logic used for logistic regression is from the Hail project.
Some of the logic used for logistic regression is from the Hail project. The Hail project can be found on Github: https://github.com/hail-is/hail. The Hail project is under an MIT license: https://github.com/hail-is/hail/blob/master/LICENSE.
Contains implementations of QC functions.
Contains implementations of QC functions. These implementations are called during both whole-stage codegen and interpreted execution.
The functions are exposed to the user as Catalyst expressions.
Implementations of utility functions for transforming variant representations.
Implementations of utility functions for transforming variant representations. These implementations are called during both whole-stage codegen and interpreted execution.
The functions are exposed to the user as Catalyst expressions.
Expression that adds fields to an existing struct.
At optimization time, this expression is rewritten as the creation of new struct with all the fields of the existing struct as well as the new fields. See io.projectglow.sql.optimizer.ReplaceExpressionsRule for more details.