Pearson Defaults
Pearson Defaults
Scaler Defaults
Scaler Defaults
Global Defaults
Global Defaults
Provide a human-readable report into stdout and in the logs that show the configuration for a model run with the key -> value relationship shown as json
Provide a human-readable report into stdout and in the logs that show the configuration for a model run with the key -> value relationship shown as json
AnyRef -> a defined case class
String in the form of pretty print syntax
Helper method for generating the Inference Config object for the data configuration steps needed to perform to reproduce the modeling for subsequent inference runs.
Helper method for generating the Inference Config object for the data configuration steps needed to perform to reproduce the modeling for subsequent inference runs.
The full main Config that is utilized for the execution of the run.
The fields that are are returned from type casting and validation (may contain artificial suffixes for StringIndexer (_si) and OneHotEncoder(_oh). These will be removed before recording.
and Instance of InferenceDataConfig
0.4.0
Single-pass method for recording all switch settings to the InferenceConfig Object.
Single-pass method for recording all switch settings to the InferenceConfig Object.
MainConfig used for starting the training AutoML run
Setter - for overriding the cardinality threshold exception threshold.
Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster.
Int: the limit above which an exception will be thrown for a classification problem wherein the label distinct count is too large to successfully generate synthetic data.
0.5.1
Default: 20
Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides.
Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides. Any non-specified fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values in non-numeric columns.
Map[String, String]: Column Name as String -> Fill Value as String
0.5.2
If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown.
,if naFillMode is specified as using Map Fill modes, this setter or the numeric na fill map MUST be set.
Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this specified value).
Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this specified value).
String: A value to fill all categorical na values in the DataFrame with.
0.5.2
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt not-improving runs to terminate the learning algorithm due to diminishing returns.
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt not-improving runs to terminate the learning algorithm due to diminishing returns.
Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements will decrement a mutable counter. If the counter hits this limit specified in value, the continuous mode algorithm will stop).
0.6.0
IllegalArgumentException
if the value is positive.
Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within the feature engineering aspect of this application.
Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within the feature engineering aspect of this application.
Int: A value that must be greater than zero.
0.6.0
IllegalArgumentException
if a value less than or equal to zero is supplied.
This value has an upper limit, depending on driver size, that will restrict the efficacy of the asynchronous tasks within the pool. Setting this too high may cause cluster instability.
Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely large data sets or a situation where using local disk storage might be prohibitively expensive)
Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely large data sets or a situation where using local disk storage might be prohibitively expensive)
String path to a dbfs location for creating the temporary (or persisted)
0.7.1
Setter for whether or not to delete the written train/test splits for the run in Delta.
Setter for whether or not to delete the written train/test splits for the run in Delta. Defaulted to true which means that the job will delete the data on Object store to clean itself up after the run is completed if the splitCachingStrategy is set to 'delta'
Boolean - true => delete false => leave on Object Store
0.7.1
Setter for determining the behavior of continuous feature columns.
Setter for determining the behavior of continuous feature columns. In order to calculate Entropy for a continuous variable, the distribution must be converted to nominal values for estimation of per-split information gain. This setting defines how many nominal categorical values to create out of a continuously distributed feature in order to calculate Entropy.
Int -> must be greater than 1
0.6.2
IllegalArgumentException
if the value specified is <= 1
Setter for configuring the concurrent count for scoring of feature interaction candidates.
Setter for configuring the concurrent count for scoring of feature interaction candidates. Due to the nature of these operations, the configuration here may need to be set differently to that of the modeling and general feature engineering phases of the toolkit. This is highly dependent on the row count of the data set being submitted.
Int -> must be greater than 0
0.6.2
IllegalArgumentException
if the value is < 1
Setter for determining the mode of operation for inclusion of interacted features.
Setter for determining the mode of operation for inclusion of interacted features. Modes are:
String -> one of: 'all', 'optimistic', or 'strict'
0.6.2
IllegalArgumentException
if the specified value submitted is not permitted
Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction candidate based on comparison to the scores of its parents.
Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction candidate based on comparison to the scores of its parents.
Double in range of -inf -> inf
0.6.2
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases.
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases. The larger this value (default: 10), the more potential space can be searched. There is not a large performance hit to this, and as such, values in excess of 100 are viable.
Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential candidates.
0.6.0
IllegalArgumentException
if the value is not greater than zero.
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
String - one of "XGBoost", "LinearRegression" or "RandomForest"
0.6.0
IllegalArgumentException
if the value is not supported
Setter for specifying the number of K-Groups to generate in the KMeans model
Setter for specifying the number of K-Groups to generate in the KMeans model
Int: number of k groups to generate
this
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
String: Options -> "euclidean" or "cosine" Default: "euclidean"
this
IllegalArgumentException()
if an invalid value is entered
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Int: Maximum limit on iterations
this
Setter for the internal KMeans column for cluster membership attribution
Setter for the internal KMeans column for cluster membership attribution
String: column name for internal algorithm column for group membership
this
Setter for a KMeans seed for the clustering algorithm
Setter for a KMeans seed for the clustering algorithm
Long: Seed value
this
Setter for Setting the tolerance for KMeans (must be >0)
Setter for Setting the tolerance for KMeans (must be >0)
The tolerance value setting for KMeans
this
IllegalArgumentException()
if a value less than 0 is entered
reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans for further details.
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Int: Count of hash tables to use
this
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH for more information
Setter for the internal LSH output hash information column
Setter for the internal LSH output hash information column
String: column name for the internal MinHashLSH Model transformation value
this
Setter for Configuring the Seed value for the LSH MinHash model
Setter for Configuring the Seed value for the LSH MinHash model
Long: A Seed value
0.5.1
Setter - for determining the label balance approach mode.
Setter - for determining the label balance approach mode.
String: one of: 'match', 'percentage' or 'target'
0.5.1
UnsupportedOperationException()
if the provided mode is not supported.
Default: "percentage"
,Available modes:
'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
'percentage' Will adjust smaller classes to a percentage value of the largest class count.
'target' Will increase smaller class counts to a fixed numeric target of rows.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
The minimum (or fixed) number of indexes to mutate.
this
In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will be mutated'
Setter for the Mutation Mode of the feature vector individual values
Setter for the Mutation Mode of the feature vector individual values
String: the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors "ratio" - uses a ratio between the values of the centroid vector and the mutation vector *
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Double: value between 0 and 1 for mutation magnitude adjustment.
this
IllegalArgumentException()
if the value specified is outside of the range (0, 1)
the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be.
Setter for defining the precision for calculating the model type as per the label column
Setter for defining the precision for calculating the model type as per the label column
Double: Precision accuracy for approximate distinct calculation.
0.5.2
java.lang.AssertionError
If the value is outside of the allowable range of {0, 1}
setting this value to zero (0) for a large regression problem will incur a long processing time and an expensive shuffle.
Mode for na fill
Available modes:
auto : Stats-based na fill for fields.
Mode for na fill
Available modes:
auto : Stats-based na fill for fields. Usage of .setNumericFillStat and
.setCharacterFillStat will inform the type of statistics that will be used to fill.
mapFill : Custom by-column overrides to 'blanket fill' na values on a per-column
basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the
numeric fields are set via .setNumericNAFillMap.
blanketFillAll : Fills all fields based on the values specified by
.setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the
appropriate types will be filled in accordingly throughout all columns.
blanketFillCharOnly Will use statistics to fill in numeric fields, but will replace
all categorical character fields na values with a blanket fill value.
blanketFillNumOnly Will use statistics to fill in character fields, but will replace
all numeric fields na values with a blanket value.
String: Mode for NA Fill
0.5.2
IllegalArgumentException
if the mods specified is not supported.
Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this specified value)
Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this specified value)
Double: A value to fill all numeric na value in the DataFrame with.
0.5.2
Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric).
Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric). Any non-specified fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values in numeric columns.
Map[String, AnyVal]: Column Name as String -> Fill Numeric Type Value
0.5.2
If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown.
,if naFillMode is specified as using Map Fill modes, this setter or the categorical na fill map MUST be set.
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Double: A fractional double in the range of 0.0 to 1.0.
0.5.1
UnsupportedOperationException()
if the provided value is outside of the range of 0.0 -> 1.0
Default: 0.2
,Setting this value to 1.0 is equivalent to setting the label balance mode to 'match'
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Int: The desired final number of rows per minority class label
0.5.1
[WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure.
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Int: Number of vectors to find nearest each centroid within the class
this
the higher the value set here, the higher the variance in synthetic data generation
Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta)
Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta)
Configuration string either 'persist' or 'delta'
0.7.1
Setter - for setting the name of the Synthetic column name
Setter - for setting the name of the Synthetic column name
String: A column name that is uniquely not part of the main DataFrame
0.5.1
Setter for the Vector Mutation Method
Setter for the Vector Mutation Method
String - the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. "all" - will mutate all of the vectors.