com.spotify.featran.transformers
package com.spotify.featran.transformers
Type members
Classlikes
Transform numerical features to binary features.
Feature values greater than
threshold
are binarized to 1.0; values equal to or less thanthreshold
are binarized to 0.0.Missing values are binarized to 0.0.
Transform a column of continuous features to n columns of feature buckets.
With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range
[x,y) except the last bucket, which also includes y. Splits should be strictly increasing.
Values at -inf, inf must be explicitly provided to cover all double values; Otherwise,
FeatureRejection.OutOfBound rejection will be reported for values outside the splits
specified.. Two examples of splits are
[x,y) except the last bucket, which also includes y. Splits should be strictly increasing.
Values at -inf, inf must be explicitly provided to cover all double values; Otherwise,
FeatureRejection.OutOfBound rejection will be reported for values outside the splits
specified.. Two examples of splits are
Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)
and Array(0.0, 1.0, 2.0)
.Note that if you have no idea of the upper and lower bounds of the targeted column, you should
add
prevent a potential FeatureRejection.OutOfBound rejection.
add
Double.NegativeInfinity
and Double.PositiveInfinity
as the bounds of your splits toprevent a potential FeatureRejection.OutOfBound rejection.
Note also that the splits that you provided have to be in strictly increasing order, i.e.
s0 < s1 < s2 < ... < sn
.Missing values are transformed to zero vectors.
Transform a collection of categorical features to binary columns, with at most N one-values.
Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU
and memory overhead.
Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU
and memory overhead.
Missing values are transformed to zero vectors.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.
the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:
English words:
{{{
sizeScalingFactor % Collisions
----------------- ------------
sizeScalingFactor % Collisions
2 17.9934%
4 10.5686%
8 5.7236%
16 3.0019%
32 1.5313%
64 0.7864%
128 0.3920%
256 0.1998%
512 0.0975%
1024 0.0478%
2048 0.0236%
4096 0.0071%
}}}
Transform a collection of weighted categorical features to columns of weight sums, with at
most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into
buckets to reduce CPU and memory overhead.
most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into
buckets to reduce CPU and memory overhead.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal
NHotEncoder.
NHotEncoder.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.
the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:
English words:
{{{
sizeScalingFactor % Collisions
----------------- ------------
sizeScalingFactor % Collisions
2 17.9934%
4 10.5686%
8 5.7236%
16 3.0019%
32 1.5313%
64 0.7864%
128 0.3920%
256 0.1998%
512 0.0975%
1024 0.0478%
2048 0.0236%
4096 0.0071%
}}}
Transform a collection of categorical features to binary columns, with at most a single
one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to
reduce CPU and memory overhead.
one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to
reduce CPU and memory overhead.
Missing values are transformed to zero vectors.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.
the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:
English words:
{{{
sizeScalingFactor % Collisions
----------------- ------------
sizeScalingFactor % Collisions
2 17.9934%
4 10.5686%
8 5.7236%
16 3.0019%
32 1.5313%
64 0.7864%
128 0.3920%
256 0.1998%
512 0.0975%
1024 0.0478%
2048 0.0236%
4096 0.0071%
}}}
Transform a collection of categorical features to 2 columns, one for rank and one for count.
Only the top heavyHittersCount items are tracked, with 1.0 being the most frequent rank, 2.0
the second most, etc. All other items are transformed to [0.0, 0.0] .
Only the top heavyHittersCount items are tracked, with 1.0 being the most frequent rank, 2.0
the second most, etc. All other items are transformed to [0.0, 0.0] .
Ranks and frequencies are estimated with Algebird's SketchMap data structure. With probability
at least
input collection.
at least
1 - delta
, this estimate is within eps * N
of the true frequency (i.e.,true frequency <= estimate <= true frequency + eps * N
), where N is the total size of theinput collection.
Missing values are transformed to [0.0, 0.0]
.
Reject values if they fall outside of either
factor * IQR
below the first quartile orfactor * IQR
above the third quartile.IQR or inter quartile range is the range between the first and the third quartiles.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the
of the approximation can be controlled with the
k
parameter.All values are transformed to zeros.
Values
rejected as FeatureRejection.Outlier.
factor * IQR
below the first quartile or factor * IQR
above the third quartile arerejected as FeatureRejection.Outlier.
When using aggregated feature summary from a previous session, values outside of previously
seen
seen
[min, max]
will also report FeatureRejection.Outlier as rejection.Transform an optional 1D feature to an indicator variable indicating presence.
Missing values are mapped to 0.0. Present values are mapped to 1.0.
Transform a column of continuous labelled features to n columns of binned categorical features.
The optimum number of bins is computed using Minimum Description Length (MDL), which is an
entropy measurement between the values and the targets.
The optimum number of bins is computed using Minimum Description Length (MDL), which is an
entropy measurement between the values and the targets.
The transformer expects an MDLRecord where the first field is a label and the second value
is the scalar that will be transformed into buckets.
is the scalar that will be transformed into buckets.
MDL is an iterative algorithm so all of the data needed to compute the buckets will be pulled
into memory. If you run into memory issues the
into memory. If you run into memory issues the
sampleRate
parameter should be lowered.References:
-
Fayyad, U., & Irani, K. (1993). "Multi-interval discretization of continuous-valued attributes
for classification learning."
Transform features by rescaling each feature to range [-1, 1]
by dividing through the maximum
absolute value in each feature.
absolute value in each feature.
Missing values are transformed to 0.0.
When using aggregated feature summary from a previous session, out of bound values are
truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.
truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.
Transform features by rescaling each feature to a specific range [`min`, `max`]
(default
[0, 1] ).
[0, 1] ).
Missing values are transformed to
min
.When using aggregated feature summary from a previous session, out of bound values are
truncated to
truncated to
min
or max
and FeatureRejection.OutOfBound rejections are reported.Transform a collection of sentences, where each row is a
into a collection containing all the n-grams that can be constructed from each row. The feature
representation is an n-hot encoding (see NHotEncoder) constructed from an expanded
vocabulary of all of the generated n-grams.
Seq[String]
of the words / tokens,into a collection containing all the n-grams that can be constructed from each row. The feature
representation is an n-hot encoding (see NHotEncoder) constructed from an expanded
vocabulary of all of the generated n-grams.
N-grams are generated based on a specified range of
by the given
low
to high
(inclusive) and are joinedby the given
sep
(default is " "). For example, with low = 2
, high = 3
and sep = ""
, row["a", "b", "c", "d", "e"]
would produce ["ab", "bc", "cd", "de", "abc", "bcd", "cde"]
.As with NHotEncoder, missing values are transformed to [0.0, 0.0, ...]
.
Transform a collection of categorical features to binary columns, with at most N one-values.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as
[FeatureRejection.Unseen] ] rejections are reported.
transformed to zero vectors or encoded as
__unknown__
(if encodeMissingValue
is true) and[FeatureRejection.Unseen] ] rejections are reported.
Transform a collection of weighted categorical features to columns of weight sums, with at most
N values.
N values.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal
NHotEncoder.
NHotEncoder.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as
[FeatureRejection.Unseen] ] rejections are reported.
transformed to zero vectors or encoded as
__unknown__
(if encodeMissingValue
is true) and[FeatureRejection.Unseen] ] rejections are reported.
Transform vector features by normalizing each vector to have unit norm. Parameter
the p-norm used for normalization (default 2).
p
specifiesthe p-norm used for normalization (default 2).
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a collection of categorical features to binary columns, with at most a single
one-value.
one-value.
Missing values are either transformed to zero vectors or encoded as a missing value.
When using aggregated feature summary from a previous session, unseen labels are either
transformed to zero vectors or encoded as
[FeatureRejection.Unseen] ] rejections are reported.
transformed to zero vectors or encoded as
__unknown__
(if encodeMissingValue
is true) and[FeatureRejection.Unseen] ] rejections are reported.
Transform vector features by expanding them into a polynomial space, which is formulated by an
n-degree combination of original dimensions.
n-degree combination of original dimensions.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a collection of categorical features to a single value that is the position
of that feature within the complete set of categories.
of that feature within the complete set of categories.
Missing values are transformed to zeros so may collide with the first position. Rejections can
be used to remove this case.
be used to remove this case.
When using aggregated feature summary from a previous session, unseen labels are ignored and
FeatureRejection.Unseen rejections are reported.
FeatureRejection.Unseen rejections are reported.
Transform a column of continuous features to n columns of binned categorical features. The
number of bins is set by the
number of bins is set by the
numBuckets
parameter.The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the
of the approximation can be controlled with the
k
parameter.Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, values outside of previously seen
rejections are reported.
[min, max]
are binned into the first or last bucket and FeatureRejection.OutOfBoundrejections are reported.
Reject values in the first and/or last quantiles defined by the number of buckets in the
numBuckets
parameter.The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the
of the approximation can be controlled with the
k
parameter.All values are transformed to zeros.
Values in the first and/or last quantiles are rejected as FeatureRejection.Outlier.
When using aggregated feature summary from a previous session, values outside of previously
seen
seen
[min, max]
will also report FeatureRejection.Outlier as rejection.case class Settings(cls: String, name: String, params: Map[String, String], featureNames: Seq[String], aggregators: Option[String])
Transform features by normalizing each feature to have unit standard deviation and/or zero
mean. When
true, it centers the data with mean before scaling.
mean. When
withStd
is true, it scales the data to unit standard deviation. When withMean
istrue, it centers the data with mean before scaling.
Missing values are transformed to 0.0 if
withMean
is true or population mean otherwise.Transform a collection of categorical features to binary columns, with at most a single
one-value. Only the top N items are tracked.
one-value. Only the top N items are tracked.
The list of top N is estimated with Algebird's SketchMap data structure. With probability
at least
input collection.
at least
1 - delta
, this estimate is within eps * N
of the true frequency (i.e.,true frequency <= estimate <= true frequency + eps * N
), where N is the total size of theinput collection.
Missing values are either transformed to zero vectors or encoded as
__unknown__
.Base class for feature transformers.
Input values are converted into intermediate type
B
, aggregated, and converted to summary typeC
. The summary type C
is then used to transform input values into features.- Type Params
- A
-
input type
- B
-
aggregator intermediate type
- C
-
aggregator summary type
- Value Params
- name
-
feature name
Takes fixed length vectors by passing them through.
Similar to Identity but for a sequence of doubles.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a column of continuous features that represent the mean of a von Mises distribution
to n columns of continuous features. The number n represent the number of points to evaluate
the von Mises distribution. The von Mises pdf is given by
to n columns of continuous features. The number n represent the number of points to evaluate
the von Mises distribution. The von Mises pdf is given by
f(x | mu, kappa, scale) = exp(kappa * cos(scale*(x-mu)) / (2piIo(kappa))
and is only valid for x, mu in the interval [0, 2*pi/scale]
.