com.spotify.featran.transformers

Transform numerical features to binary features.

Feature values greater than threshold are binarized to 1.0; values equal to or less than
threshold are binarized to 0.0.

Missing values are binarized to 0.0.

Transform a column of continuous features to n columns of feature buckets.

With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range
[x,y) except the last bucket, which also includes y. Splits should be strictly increasing.
Values at -inf, inf must be explicitly provided to cover all double values; Otherwise,
FeatureRejection.OutOfBound rejection will be reported for values outside the splits
specified.. Two examples of splits are
Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).

Note that if you have no idea of the upper and lower bounds of the targeted column, you should
add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to
prevent a potential FeatureRejection.OutOfBound rejection.

Note also that the splits that you provided have to be in strictly increasing order, i.e.
s0 < s1 < s2 < ... < sn.

Missing values are transformed to zero vectors.

Transform a collection of categorical features to binary columns, with at most N one-values.
Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU
and memory overhead.

Missing values are transformed to zero vectors.

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:

{{{
sizeScalingFactor % Collisions

----------------- ------------

           2     17.9934%
           4     10.5686%
           8      5.7236%
          16      3.0019%
          32      1.5313%
          64      0.7864%
         128      0.3920%
         256      0.1998%
         512      0.0975%
        1024      0.0478%
        2048      0.0236%
        4096      0.0071%

}}}

Transform a collection of weighted categorical features to columns of weight sums, with at
most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into
buckets to reduce CPU and memory overhead.

Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal
NHotEncoder.

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:

{{{
sizeScalingFactor % Collisions

----------------- ------------

           2     17.9934%
           4     10.5686%
           8      5.7236%
          16      3.0019%
          32      1.5313%
          64      0.7864%
         128      0.3920%
         256      0.1998%
         512      0.0975%
        1024      0.0478%
        2048      0.0236%
        4096      0.0071%

}}}

Transform a collection of categorical features to binary columns, with at most a single
one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to
reduce CPU and memory overhead.

Missing values are transformed to zero vectors.

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:

{{{
sizeScalingFactor % Collisions

----------------- ------------

           2     17.9934%
           4     10.5686%
           8      5.7236%
          16      3.0019%
          32      1.5313%
          64      0.7864%
         128      0.3920%
         256      0.1998%
         512      0.0975%
        1024      0.0478%
        2048      0.0236%
        4096      0.0071%

}}}

Transform a collection of categorical features to 2 columns, one for rank and one for count.
Only the top heavyHittersCount items are tracked, with 1.0 being the most frequent rank, 2.0
the second most, etc. All other items are transformed to [0.0, 0.0] .

Ranks and frequencies are estimated with Algebird's SketchMap data structure. With probability
at least 1 - delta, this estimate is within eps * N of the true frequency (i.e.,
true frequency <= estimate <= true frequency + eps * N), where N is the total size of the
input collection.

Missing values are transformed to [0.0, 0.0] .

Reject values if they fall outside of either factor * IQR below the first quartile or
factor * IQR above the third quartile.

IQR or inter quartile range is the range between the first and the third quartiles.

The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the k parameter.

All values are transformed to zeros.

Values factor * IQR below the first quartile or factor * IQR above the third quartile are
rejected as FeatureRejection.Outlier.

When using aggregated feature summary from a previous session, values outside of previously
seen [min, max] will also report FeatureRejection.Outlier as rejection.

Transform features by passing them through.

Missing values are transformed to 0.0.

Transform an optional 1D feature to an indicator variable indicating presence.

Missing values are mapped to 0.0. Present values are mapped to 1.0.

Transform a column of continuous labelled features to n columns of binned categorical features.
The optimum number of bins is computed using Minimum Description Length (MDL), which is an
entropy measurement between the values and the targets.

The transformer expects an MDLRecord where the first field is a label and the second value
is the scalar that will be transformed into buckets.

MDL is an iterative algorithm so all of the data needed to compute the buckets will be pulled
into memory. If you run into memory issues the sampleRate parameter should be lowered.

References:

Fayyad, U., & Irani, K. (1993). "Multi-interval discretization of continuous-valued attributes
for classification learning."
https://github.com/sramirez/spark-MDLP-discretization

Labelled feature for MDL.

Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum
absolute value in each feature.

Missing values are transformed to 0.0.

When using aggregated feature summary from a previous session, out of bound values are
truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.

Transform features by rescaling each feature to a specific range [`min`, `max`] (default
[0, 1] ).

Missing values are transformed to min.

When using aggregated feature summary from a previous session, out of bound values are
truncated to min or max and FeatureRejection.OutOfBound rejections are reported.

Transform a collection of sentences, where each row is a Seq[String] of the words / tokens,
into a collection containing all the n-grams that can be constructed from each row. The feature
representation is an n-hot encoding (see NHotEncoder) constructed from an expanded
vocabulary of all of the generated n-grams.

N-grams are generated based on a specified range of low to high (inclusive) and are joined
by the given sep (default is " "). For example, with low = 2, high = 3 and sep = "", row
["a", "b", "c", "d", "e"] would produce ["ab", "bc", "cd", "de", "abc", "bcd", "cde"].

As with NHotEncoder, missing values are transformed to [0.0, 0.0, ...] .

Transform a collection of categorical features to binary columns, with at most N one-values.

Missing values are either transformed to zero vectors or encoded as a missing value.

When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as __unknown__ (if encodeMissingValue is true) and
[FeatureRejection.Unseen] ] rejections are reported.

Transform a collection of weighted categorical features to columns of weight sums, with at most
N values.

Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal
NHotEncoder.

Missing values are either transformed to zero vectors or encoded as a missing value.

When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as __unknown__ (if encodeMissingValue is true) and
[FeatureRejection.Unseen] ] rejections are reported.

Transform vector features by normalizing each vector to have unit norm. Parameter p specifies
the p-norm used for normalization (default 2).

Missing values are transformed to zero vectors.

When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.

Transform a collection of categorical features to binary columns, with at most a single
one-value.

Missing values are either transformed to zero vectors or encoded as a missing value.

When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as __unknown__ (if encodeMissingValue is true) and
[FeatureRejection.Unseen] ] rejections are reported.

Transform vector features by expanding them into a polynomial space, which is formulated by an
n-degree combination of original dimensions.

Missing values are transformed to zero vectors.

When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.

Transform a collection of categorical features to a single value that is the position
of that feature within the complete set of categories.

Missing values are transformed to zeros so may collide with the first position. Rejections can
be used to remove this case.

When using aggregated feature summary from a previous session, unseen labels are ignored and
FeatureRejection.Unseen rejections are reported.

Transform a column of continuous features to n columns of binned categorical features. The
number of bins is set by the numBuckets parameter.

The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the k parameter.

Missing values are transformed to zero vectors.

When using aggregated feature summary from a previous session, values outside of previously seen
[min, max] are binned into the first or last bucket and FeatureRejection.OutOfBound
rejections are reported.

Reject values in the first and/or last quantiles defined by the number of buckets in the
numBuckets parameter.

The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the k parameter.

All values are transformed to zeros.

Values in the first and/or last quantiles are rejected as FeatureRejection.Outlier.

When using aggregated feature summary from a previous session, values outside of previously
seen [min, max] will also report FeatureRejection.Outlier as rejection.

Transform features by normalizing each feature to have unit standard deviation and/or zero
mean. When withStd is true, it scales the data to unit standard deviation. When withMean is
true, it centers the data with mean before scaling.

Missing values are transformed to 0.0 if withMean is true or population mean otherwise.

Transform a collection of categorical features to binary columns, with at most a single
one-value. Only the top N items are tracked.

The list of top N is estimated with Algebird's SketchMap data structure. With probability
at least 1 - delta, this estimate is within eps * N of the true frequency (i.e.,
true frequency <= estimate <= true frequency + eps * N), where N is the total size of the
input collection.

Missing values are either transformed to zero vectors or encoded as __unknown__.

Base class for feature transformers.

Input values are converted into intermediate type B, aggregated, and converted to summary type
C. The summary type C is then used to transform input values into features.

Type Params

A: input type
B: aggregator intermediate type
C: aggregator summary type

Value Params

name: feature name

Takes fixed length vectors by passing them through.

Similar to Identity but for a sequence of doubles.

Missing values are transformed to zero vectors.

When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.

Transform a column of continuous features that represent the mean of a von Mises distribution
to n columns of continuous features. The number n represent the number of points to evaluate
the von Mises distribution. The von Mises pdf is given by

f(x | mu, kappa, scale) = exp(kappa * cos(scale*(x-mu)) / (2piIo(kappa))

and is only valid for x, mu in the interval [0, 2*pi/scale] .

Weighted label. Also can be thought as a weighted value in a named sparse vector.

com.spotify.featran.transformers

Type members

Classlikes