Base class for feature transformers.
Weighted label.
Weighted label. Also can be thought as a weighted value in a named sparse vector.
Transform numerical features to binary features.
Transform numerical features to binary features.
Feature values greater than threshold
are binarized to 1.0; values equal to or less than
threshold
are binarized to 0.0.
Missing values are binarized to 0.0.
Transform a column of continuous features to n columns of feature buckets.
Transform a column of continuous features to n columns of feature buckets.
With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range
[x,y) except the last bucket, which also includes y. Splits should be strictly increasing.
Values at -inf, inf must be explicitly provided to cover all double values; Otherwise,
FeatureRejection.OutOfBound rejection will be reported for values outside the splits
specified.. Two examples of splits are
Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)
and Array(0.0, 1.0, 2.0)
.
Note that if you have no idea of the upper and lower bounds of the targeted column, you should
add Double.NegativeInfinity
and Double.PositiveInfinity
as the bounds of your splits to
prevent a potential FeatureRejection.OutOfBound rejection.
Note also that the splits that you provided have to be in strictly increasing order, i.e.
s0 < s1 < s2 < ... < sn
.
Missing values are transformed to zero vectors.
Transform a collection of categorical features to binary columns, with at most N one-values.
Transform a collection of categorical features to binary columns, with at most N one-values. Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.
Missing values are transformed to [0.0, 0.0, ...].
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:
sizeScalingFactor % Collisions ----------------- ------------ 2 17.9934% 4 10.5686% 8 5.7236% 16 3.0019% 32 1.5313% 64 0.7864% 128 0.3920% 256 0.1998% 512 0.0975% 1024 0.0478% 2048 0.0236% 4096 0.0071%
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:
sizeScalingFactor % Collisions ----------------- ------------ 2 17.9934% 4 10.5686% 8 5.7236% 16 3.0019% 32 1.5313% 64 0.7864% 128 0.3920% 256 0.1998% 512 0.0975% 1024 0.0478% 2048 0.0236% 4096 0.0071%
Transform a collection of categorical features to binary columns, with at most a single one-value.
Transform a collection of categorical features to binary columns, with at most a single one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.
Missing values are transformed to [0.0, 0.0, ...].
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:
sizeScalingFactor % Collisions ----------------- ------------ 2 17.9934% 4 10.5686% 8 5.7236% 16 3.0019% 32 1.5313% 64 0.7864% 128 0.3920% 256 0.1998% 512 0.0975% 1024 0.0478% 2048 0.0236% 4096 0.0071%
Transform features by passing them through.
Transform features by passing them through.
Missing values are transformed to 0.0.
Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Missing values are transformed to 0.0.
When using aggregated feature summary from a previous session, out of bound values are truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.
Transform features by rescaling each feature to a specific range [min
, max
] (default
[0, 1]).
Transform features by rescaling each feature to a specific range [min
, max
] (default
[0, 1]).
Missing values are transformed to min
.
When using aggregated feature summary from a previous session, out of bound values are
truncated to min
or max
and FeatureRejection.OutOfBound rejections are reported.
Transform a collection of categorical features to binary columns, with at most N one-values.
Transform a collection of categorical features to binary columns, with at most N one-values.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
Transform vector features by normalizing each vector to have unit norm.
Transform vector features by normalizing each vector to have unit norm. Parameter p
specifies
the p-norm used for normalization (default 2).
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a collection of categorical features to binary columns, with at most a single one-value.
Transform a collection of categorical features to binary columns, with at most a single one-value.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a column of continuous features to n columns of binned categorical features.
Transform a column of continuous features to n columns of binned categorical features. The
number of bins is set by the numBuckets
parameter.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision
of the approximation can be controlled with the k
parameter.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, values outside of previously seen
[min, max]
are binned into the first or last bucket and FeatureRejection.OutOfBound
rejections are reported.
Transform features by normalizing each feature to have unit standard deviation and/or zero mean.
Transform features by normalizing each feature to have unit standard deviation and/or zero
mean. When withStd
is true, it scales the data to unit standard deviation. When withMean
is
true, it centers the data with mean before scaling.
Missing values are transformed to 0.0 if withMean
is true or population mean otherwise.
Takes fixed length vectors by passing them through.
Takes fixed length vectors by passing them through.
Similar to Identity but for a sequence of doubles.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features.
Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features. The number n represent the number of points to evaluate the von Mises distribution. The von Mises pdf is given by
f(x | mu, kappa, scale) = exp(kappa * cos(scale*(x-mu)) / (2*pi*Io(kappa))
and is only valid for x, mu in the interval [0, 2*pi/scale].
Base class for feature transformers.
Input values are converted into intermediate type
B
, aggregated, and converted to summary typeC
. The summary typeC
is then used to transform input values into features.input type
aggregator intermediate type
aggregator summary type