Helper class to compute approximate quantile summary.
Helper class to compute approximate quantile summary. This implementation is based on the algorithm proposed in the paper: "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)
In order to optimize for speed, it maintains an internal buffer of the last seen samples, and only inserts them after crossing a certain size threshold. This guarantees a near-constant runtime complexity compared to the original algorithm.
Calculate the covariance of two numerical columns of a DataFrame.
Calculate the covariance of two numerical columns of a DataFrame.
The DataFrame
the column names
the covariance of the two columns.
Generate a table of frequencies for the elements of two columns.
Calculates the approximate quantiles of multiple numerical columns of a DataFrame in one pass.
Calculates the approximate quantiles of multiple numerical columns of a DataFrame in one pass.
The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at probability p
up to error
err
, then the algorithm will return a sample x
from the DataFrame so that the *exact* rank
of x
is close to (p * N).
More precisely,
floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
the dataframe
numerical columns of the dataframe
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
for each column, returns the requested approximations
Calculate the Pearson Correlation Coefficient for the given columns