org.apache.spark.ml.automl.feature
Main fit method that will build a BinaryEncoder model from the data set and the configured input and output columns specified in the setters.
Main fit method that will build a BinaryEncoder model from the data set and the configured input and output columns specified in the setters. The primary principle at work here is dimensionality reduction for the encoding of extremely high-cardinality StringIndexed columns. OneHotEncoding works extremely well for this purpose, but has the side-effect of requiring extremely large amounts of columns to be generated when performing OHE is increased memory pressure. This package allows for a lossy reduction in this space by distilling the information into a binary string encoding space that is dynamic based on the encoded length of the maximum nominal space as represented in binary
The dataset (or DataFrame) used in training the model
BinaryEncoderModel - a serializable artifact that has the output schema and encoding embedded within it.
e.g. if the cardinality of a nominal column is 113, the binary representation of that is 1110001. When using OHE, this would result in 113 (or 114 if allowing invalids) binary positions within a sparse vector, creating 113 or 114 columns in the dataset. However, using BinaryEncoder, we are left with 7 (or 8, if allowing invalids) dense vector positions to capture the same amount of information.
0.5.3
Due to the nature of this encoding and how the majority of models learn, this is seen as an information loss encoding. However, considering that high cardinality non-numeric nominal fields are frequently discarded due to the explosion of the data set, this is providing the ability to utilize high cardinality fields that otherwise would not be able to be included.
Configuration of the Parameter for handling invalid entries in a previously modeled feature column.
Configuration of the Parameter for handling invalid entries in a previously modeled feature column.
Setter for supplying an optional 'keep' or 'error' (Default: 'error') for un-seen values that arrive into a pre-trained model.
Setter for supplying an optional 'keep' or 'error' (Default: 'error') for un-seen values that arrive into a pre-trained model. With the 'keep' setting, an additional vector position is added to the output column to ensure no collisions may exist with real data and the values throughout each of the Array[Double] locations in the DenseVector output will all be set to '1'
String: either 'keep' or 'error' (Default: 'error')
0.5.3
SparkException
if the configuration value supplied is not either 'keep' or 'error'
Setter for supplying the array of input columns to be encoded with the BinaryEncoder type
Setter for supplying the array of input columns to be encoded with the BinaryEncoder type
Array of column names
0.5.3
Setter for supplying the array of output columns that are the result of running a .transform from a trained model on an appropriate dataset of compatible schema
Setter for supplying the array of output columns that are the result of running a .transform from a trained model on an appropriate dataset of compatible schema
Array of column names that will be generated through a .transform
0.5.3
Method for validating the resultant schema from the application of building and transforming using this encoder package.
Method for validating the resultant schema from the application of building and transforming using this encoder package. The purpose of validation is to ensure that the supplied input columns are of the correct binary or nominal (ordinal numeric) type and that the output columns will contain the correct number of columns based on the configuration set.
The schema of the dataset supplied for training of the model or used in transforming using the model
Boolean flag for whether to allow for an additional binary encoding value to be used for any values that were unknown at the time of model training, which will summarily be converted to a 'max binary value' of the encoding length + 1 with maximum n * "1" values.
StructType that represents the transformed schema with additional output columns appended to the dataset structure.
0.5.3
UnsupportedOperationException
if the configured input cols and output cols do not match one another in
length.