transformers

Type Members

case class Settings(cls: String, name: String, params: Map[String, String], featureNames: Seq[String], aggregators: Option[String]) extends Product with Serializable
abstract class Transformer[-A, B, C] extends Serializable

Base class for feature transformers.
Base class for feature transformers.
Input values are converted into intermediate type B, aggregated, and converted to summary type C. The summary type C is then used to transform input values into features.
A
input type
B
aggregator intermediate type
C
aggregator summary type
case class WeightedLabel(name: String, value: Double) extends Product with Serializable

Weighted label.
Weighted label. Also can be thought as a weighted value in a named sparse vector.

Value Members

object Binarizer extends Serializable

Transform numerical features to binary features.
Transform numerical features to binary features.
Feature values greater than threshold are binarized to 1.0; values equal to or less than threshold are binarized to 0.0.
Missing values are binarized to 0.0.
object Bucketizer extends Serializable

Transform a column of continuous features to n columns of feature buckets.
Transform a column of continuous features to n columns of feature buckets.
With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all double values; Otherwise, FeatureRejection.OutOfBound rejection will be reported for values outside the splits specified.. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to prevent a potential FeatureRejection.OutOfBound rejection.
Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < ... < sn.
Missing values are transformed to zero vectors.

object HashNHotEncoder extends Serializable

Transform a collection of categorical features to binary columns, with at most N one-values.

Transform a collection of categorical features to binary columns, with at most N one-values. Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.

Missing values are transformed to [0.0, 0.0, ...].

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:

sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
               16      3.0019%
               32      1.5313%
               64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
             1024      0.0478%
             2048      0.0236%
             4096      0.0071%

object HashNHotWeightedEncoder extends Serializable

Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:
```
sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
               16      3.0019%
               32      1.5313%
               64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
             1024      0.0478%
             2048      0.0236%
             4096      0.0071%
```

object HashOneHotEncoder extends Serializable

Transform a collection of categorical features to binary columns, with at most a single one-value.

Transform a collection of categorical features to binary columns, with at most a single one-value. Similar to OneHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU and memory overhead.

Missing values are transformed to [0.0, 0.0, ...].

If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce the number of collisions.

Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544 English words:

sizeScalingFactor     % Collisions
-----------------     ------------
                2     17.9934%
                4     10.5686%
                8      5.7236%
               16      3.0019%
               32      1.5313%
               64      0.7864%
              128      0.3920%
              256      0.1998%
              512      0.0975%
             1024      0.0478%
             2048      0.0236%
             4096      0.0071%

object Identity extends Serializable

Transform features by passing them through.
Transform features by passing them through.
Missing values are transformed to 0.0.
object MaxAbsScaler extends Serializable

Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Transform features by rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
Missing values are transformed to 0.0.
When using aggregated feature summary from a previous session, out of bound values are truncated to -1.0 or 1.0 and FeatureRejection.OutOfBound rejections are reported.
object MinMaxScaler extends Serializable

Transform features by rescaling each feature to a specific range [min, max] (default [0, 1]).
Transform features by rescaling each feature to a specific range [min, max] (default [0, 1]).
Missing values are transformed to min.
When using aggregated feature summary from a previous session, out of bound values are truncated to min or max and FeatureRejection.OutOfBound rejections are reported.
object NHotEncoder extends Serializable

Transform a collection of categorical features to binary columns, with at most N one-values.
Transform a collection of categorical features to binary columns, with at most N one-values.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
object NHotWeightedEncoder extends Serializable

Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Transform a collection of weighted categorical features to columns of weight sums, with at most N values.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal NHotEncoder.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
object Normalizer extends Serializable

Transform vector features by normalizing each vector to have unit norm.
Transform vector features by normalizing each vector to have unit norm. Parameter p specifies the p-norm used for normalization (default 2).
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object OneHotEncoder extends Serializable

Transform a collection of categorical features to binary columns, with at most a single one-value.
Transform a collection of categorical features to binary columns, with at most a single one-value.
Missing values are transformed to [0.0, 0.0, ...].
When using aggregated feature summary from a previous session, unseen labels are ignored and FeatureRejection.Unseen rejections are reported.
object PolynomialExpansion extends Serializable

Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Transform vector features by expanding them into a polynomial space, which is formulated by an n-degree combination of original dimensions.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object QuantileDiscretizer extends Serializable

Transform a column of continuous features to n columns of binned categorical features.
Transform a column of continuous features to n columns of binned categorical features. The number of bins is set by the numBuckets parameter.
The bin ranges are chosen using the Algebird's QTree approximate data structure. The precision of the approximation can be controlled with the k parameter.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, values outside of previously seen [min, max] are binned into the first or last bucket and FeatureRejection.OutOfBound rejections are reported.
object StandardScaler extends Serializable

Transform features by normalizing each feature to have unit standard deviation and/or zero mean.
Transform features by normalizing each feature to have unit standard deviation and/or zero mean. When withStd is true, it scales the data to unit standard deviation. When withMean is true, it centers the data with mean before scaling.
Missing values are transformed to 0.0 if withMean is true or population mean otherwise.
object VectorIdentity extends Serializable

Takes fixed length vectors by passing them through.
Takes fixed length vectors by passing them through.
Similar to Identity but for a sequence of doubles.
Missing values are transformed to zero vectors.
When using aggregated feature summary from a previous session, vectors of different dimensions are transformed to zero vectors and FeatureRejection.WrongDimension rejections are reported.
object VonMisesEvaluator extends Serializable

Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features.
Transform a column of continuous features that represent the mean of a von Mises distribution to n columns of continuous features. The number n represent the number of points to evaluate the von Mises distribution. The von Mises pdf is given by
f(x | mu, kappa, scale) = exp(kappa * cos(scale*(x-mu)) / (2*pi*Io(kappa))
and is only valid for x, mu in the interval [0, 2*pi/scale].

package transformers

Type Members

case class Settings(cls: String, name: String, params: Map[String, String], featureNames: Seq[String], aggregators: Option[String]) extends Product with Serializable

abstract class Transformer[-A, B, C] extends Serializable

case class WeightedLabel(name: String, value: Double) extends Product with Serializable

Value Members

object Binarizer extends Serializable

object Bucketizer extends Serializable

object HashNHotEncoder extends Serializable

object HashNHotWeightedEncoder extends Serializable

object HashOneHotEncoder extends Serializable

object Identity extends Serializable

object MaxAbsScaler extends Serializable

object MinMaxScaler extends Serializable

object NHotEncoder extends Serializable

object NHotWeightedEncoder extends Serializable

object Normalizer extends Serializable

object OneHotEncoder extends Serializable

object PolynomialExpansion extends Serializable

object QuantileDiscretizer extends Serializable

object StandardScaler extends Serializable

object VectorIdentity extends Serializable

object VonMisesEvaluator extends Serializable

Ungrouped