com.databricks.labs.automl.executor.config
The model family that is desired to be run (e.g. 'RandomForest') Allowable Options: "Trees", "GBT", "RandomForest", "LinearRegression", "LogisticRegression", "XGBoost", "MLPC", "SVM"
The modeling type that is desired to be run (e.g. 'classifier') Allowable Options: "classifier" or "regressor"
Configuration object from GenericConfigGenerator
Static restrictions
Static restrictions
Boolean switch for setting Auto Stopping Off
Boolean switch for setting Auto Stopping Off
Default: Off
Boolean switch for setting Auto Stopping On
Boolean switch for setting Auto Stopping On
Early stopping will invalidate the progress measurement system (due to non-determinism) Early termination will not occur immediately. Futures objects already committed will continue to run, but no new actions will be enqueued when a stopping criteria is met.
,Default: Off
Setter switch for turning cardinality switch off.
Setter switch for turning cardinality switch off.
0.5.2
Default: true
,Not recommended for exploratory data set features.
Setter switch for turning cardinality switch on This switch is intended to set whether the a cardinality check is performed on StringIndexed columns
Setter switch for turning cardinality switch on This switch is intended to set whether the a cardinality check is performed on StringIndexed columns
0.5.2
Default: true
Boolean switch for turning Covariance filtering off
Boolean switch for turning Covariance filtering off
Default: Off
Boolean switch for turning Covariance filtering on
Boolean switch for turning Covariance filtering on
Default: Off
Boolean switch for setting the Data Prep Caching Off
Boolean switch for setting the Data Prep Caching Off
Depending on the size and partitioning of the data set, caching may or may not improve performance.
,Default: On
Boolean switch for setting the Data Prep Caching On
Boolean switch for setting the Data Prep Caching On
Depending on the size and partitioning of the data set, caching may or may not improve performance.
,Default: On
Boolean switch for turning featureInteraction off
Boolean switch for turning featureInteraction off
0.6.2
Boolean switch for setting featureInteraction on.
Boolean switch for setting featureInteraction on. This setting will, in conjunction with the settings for featureInteraction elements in the config, perform pair-wise product interactions of all elements of the feature vector, retaining either all or some of those interactions for inclusion to the feature vector. For classification tasks, InformationGain is used as the metric to compare inclusion (for modes other than 'all') For regression tasks, Variance is used as the metric.
0.6.2
Configuration object from GenericConfigGenerator
Getters
Boolean switch for turning off naFill actions
Boolean switch for turning off naFill actions
HIGHLY RECOMMENDED TO NOT TURN OFF
,Default: On
Boolean switch for turning on naFill actions
Boolean switch for turning on naFill actions
HIGHLY RECOMMENDED TO LEAVE ON.
,Default: On
Boolean switch for turning off One Hot Encoding
Boolean switch for turning off One Hot Encoding
Default: Off for Tree based algorithms, On for all others.
Boolean switch for turning One Hot Encoding of string and character features on
Boolean switch for turning One Hot Encoding of string and character features on
Turning One Hot Encoding on for a tree-based algorithm (XGBoost, RandomForest, Trees, GBT) is not recommended. Introducing synthetic dummy variables in a tree algorithm will force the creation of sparse tree splits.
,Default: Off for Tree based algorithms, On for all others.
See https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 for a full explanation.
Boolean switch for turning outlier filtering off
Boolean switch for turning outlier filtering off
Default: Off
Boolean switch for turning outlier filtering on
Boolean switch for turning outlier filtering on
Default: Off
Boolean switch for turning Pearson filtering off
Boolean switch for turning Pearson filtering off
Default: Off
Boolean switch for turning Pearson filtering on
Boolean switch for turning Pearson filtering on
Default: Off
Boolean switch for turning scaling Off
Boolean switch for turning scaling Off
Default: Off for Tree based algorithms, On for all others.
Boolean switch for turning scaling On
Boolean switch for turning scaling On
For Tree based algorithms (RandomForest, XGBoost, GBT, Trees), it is not necessary (and can adversely affect the model performance) that this be turned on.
,Default: Off for Tree based algorithms, On for all others.
Boolean switch for setting the state of autoStoppingFlag
Boolean switch for setting the state of autoStoppingFlag
Boolean
Helper method for copying a pre-defined InstanceConfig to a new instance.
Helper method for copying a pre-defined InstanceConfig to a new instance.
InstanceConfig object
Setter
Covariance Cutoff for specifying the feature-to-feature correlation statistic upper cutoff boundary
Setter
Covariance Cutoff for specifying the feature-to-feature correlation statistic upper cutoff boundary
Double: Threshold Cutoff Value
For feature columns A, B, and C, if A<->B is 0.02, A<->C is 0.1, B<->C is 0.85, with a value set of 0.8,
Column C would be removed from the feature vector for having a high value of the correlation
statistic.
IllegalArgumentException
if the value is <= -1.0
WARNING This setting is not recommended to be used in a production use case and is only potentially useful for data exploration and experimentation.
,Default: 0.99
Setter
Covariance Cutoff for specifying the feature-to-feature correlation statistic lower cutoff boundary
Setter
Covariance Cutoff for specifying the feature-to-feature correlation statistic lower cutoff boundary
Double: Threshold Cutoff Value
For feature columns A, B, and C, if A->B is 0.02, A->C is 0.1, B->C is 0.85, with a value set of 0.05,
Column A would be removed from the feature vector for having a low value of the correlation
statistic.
IllegalArgumentException
if the value is <= -1.0
WARNING the lower threshold boundary for correlation is less frequently used. Filtering of auto-correlated features is done primarily through .setCovarianceCutoffHigh values lower than the default of 0.99
,WARNING This setting is not recommended to be used in a production use case and is only potentially useful for data exploration and experimentation.
,Default: -0.99
Boolean switch for setting the state of covarianceFilterFlag
Boolean switch for setting the state of covarianceFilterFlag
Boolean
Boolean switch for setting the state of DataPrepCachingFlag
Boolean switch for setting the state of DataPrepCachingFlag
Boolean
Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within the feature engineering aspect of this application.
Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within the feature engineering aspect of this application.
Int: A value that must be greater than zero.
0.6.0
IllegalArgumentException
if a value less than or equal to zero is supplied.
This value has an upper limit, depending on driver size, that will restrict the efficacy of the asynchronous tasks within the pool. Setting this too high may cause cluster instability.
Setter for determining the behavior of continuous feature columns.
Setter for determining the behavior of continuous feature columns. In order to calculate Entropy for a continuous variable, the distribution must be converted to nominal values for estimation of per-split information gain. This setting defines how many nominal categorical values to create out of a continuously distributed feature in order to calculate Entropy.
Int -> must be greater than 1
0.6.2
IllegalArgumentException
if the value specified is <= 1
Setter for defining the state of the featureInteractionFlag
Setter for defining the state of the featureInteractionFlag
Boolean on/off
0.6.2
Setter for configuring the concurrent count for scoring of feature interaction candidates.
Setter for configuring the concurrent count for scoring of feature interaction candidates. Due to the nature of these operations, the configuration here may need to be set differently to that of the modeling and general feature engineering phases of the toolkit. This is highly dependent on the row count of the data set being submitted.
Int -> must be greater than 0
0.6.2
IllegalArgumentException
if the value is < 1
Setter for determining the mode of operation for inclusion of interacted features.
Setter for determining the mode of operation for inclusion of interacted features. Modes are:
String -> one of: 'all', 'optimistic', or 'strict'
0.6.2
IllegalArgumentException
if the specified value submitted is not permitted
Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction candidate based on comparison to the scores of its parents.
Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction candidate based on comparison to the scores of its parents.
Double in range of -inf -> inf
0.6.2
Setter for the cardinality check mode to be used.
Setter for the cardinality check mode to be used. Available modes are "warn" and "silent". In "warn" mode, an exception will be thrown if the cardinality for a categorical column is above the threshold. In "silent" mode, the field will be ignored from processing and will not be included in the feature vector.
String: either "warn" or "silent"
0.5.2
IllegalArgumentException
if the mode supplied is not either "warn" or "silent"
Default: "silent"
Setter for overriding the default cardinality limit when validating whether a field should be considered for OneHotEncoding or StringIndexing
Setter for overriding the default cardinality limit when validating whether a field should be considered for OneHotEncoding or StringIndexing
Int: The value at above which a field will be declared to be of too high a cardinality for StringIndexing or OneHotEncoding
0.5.2
java.lang.IllegalArgumentException
if the number is <= to 0
Default: 200
Setter for defining the precision calculation when in "approx" mode for cardinalityType.
Setter for defining the precision calculation when in "approx" mode for cardinalityType. Must be in range 0 -> 1
Double: The precision for approximate distinct calculations for cardinality purposes
0.5.2
java.lang.IllegalArgumentException
if the Double supplied is outside of the range of 0 -> 1
Setter for direct override of the cardinality switch
Setter for direct override of the cardinality switch
0.5.2
Default: true
Setter for specifying the mode of cardinality checking [either "approx" for approximate distinct or "exact"]
Setter for specifying the mode of cardinality checking [either "approx" for approximate distinct or "exact"]
String: either "approx" or "exact"
0.5.2
IllegalArgumentException
if a mode other than exact or approx is specified.
Default - exact
Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides.
Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides. Any non-specified fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values in non-numeric columns.
Map[String, String]: Column Name as String -> Fill Value as String
0.5.2
If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown.
,if naFillMode is specified as using Map Fill modes, this setter or the numeric na fill map MUST be set.
Setter Specifies the behavior of the naFill algorithm for character (String, Char, Boolean, Byte, etc.) fields.
Setter
Specifies the behavior of the naFill algorithm for character (String, Char, Boolean, Byte, etc.) fields.
Generated through a df.summary() method
Available options are:
"min" (least frequently occurring value)
or
"max" (most frequently occurring value)
String: member of allowable list
IllegalArgumentException
if an invalid entry is made.
Default: "max"
Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this specified value).
Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this specified value).
String: A value to fill all categorical na values in the DataFrame with.
0.5.2
Setter for defining the precision for calculating the model type as per the label column
Setter for defining the precision for calculating the model type as per the label column
Double: Precision accuracy for approximate distinct calculation.
0.5.2
java.lang.AssertionError
If the value is outside of the allowable range of {0, 1}
setting this value to zero (0) for a large regression problem will incur a long processing time and an expensive shuffle.
Mode for na fill
Available modes:
auto : Stats-based na fill for fields.
Mode for na fill
Available modes:
auto : Stats-based na fill for fields. Usage of .setNumericFillStat and
.setCharacterFillStat will inform the type of statistics that will be used to fill.
mapFill : Custom by-column overrides to 'blanket fill' na values on a per-column
basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the
numeric fields are set via .setNumericNAFillMap.
blanketFillAll : Fills all fields based on the values specified by
.setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the
appropriate types will be filled in accordingly throughout all columns.
blanketFillCharOnly Will use statistics to fill in numeric fields, but will replace
all categorical character fields na values with a blanket fill value.
blanketFillNumOnly Will use statistics to fill in character fields, but will replace
all numeric fields na values with a blanket value.
String: Mode for NA Fill
0.5.2
IllegalArgumentException
if the mods specified is not supported.
Setter
Specifies the behavior of the naFill algorithm for numeric (continuous) fields.
Values that are generated as potential fill candidates are set according to the available statistics that are
calculated from a df.summary() method.
Available options are:
"min", "25p", "mean", "median", "75p", or "max"
Setter
Specifies the behavior of the naFill algorithm for numeric (continuous) fields.
Values that are generated as potential fill candidates are set according to the available statistics that are
calculated from a df.summary() method.
Available options are:
"min", "25p", "mean", "median", "75p", or "max"
String: member of allowable list.
IllegalArgumentException
if an invalid entry is made.
Default: "mean"
Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this specified value)
Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this specified value)
Double: A value to fill all numeric na value in the DataFrame with.
0.5.2
Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric).
Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric). Any non-specified fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values in numeric columns.
Map[String, AnyVal]: Column Name as String -> Fill Numeric Type Value
0.5.2
If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown.
,if naFillMode is specified as using Map Fill modes, this setter or the categorical na fill map MUST be set.
Setter
Allows for setting a series of custom mlflow logging tags to an experiment run (universal across all
iterations and models of the run) to be logged in mlflow as a custom tag key value pair
Setter
Allows for setting a series of custom mlflow logging tags to an experiment run (universal across all
iterations and models of the run) to be logged in mlflow as a custom tag key value pair
Array of Map[String -> AnyVal]
The mapped values can be of types: Double, Float, Long, Int, Short, Byte, Boolean, or String
MLFlow Logging Config
Boolean switch for setting the state of naFillFlag
Boolean switch for setting the state of naFillFlag
Boolean (whether to execute filling of na values on the DataFrame's non-ignored fields)
Boolean switch for setting the state of oneHotEncodeFlag
Boolean switch for setting the state of oneHotEncodeFlag
Boolean
Setter
Defines the determination of whether to classify a numeric field as ordinal (categorical) or
continuous.
Setter
Defines the determination of whether to classify a numeric field as ordinal (categorical) or
continuous.
Int: Threshold for distinct counts within a numeric feature field.
Continuous data fields are eligible for outlier filtering. Categorical fields are not, and if below cardinality thresholds set by this value setter, those fields will be ignored by the filtering action.
Setter
Defines an Array of fields to be ignored from outlier filtering.
Setter
Defines an Array of fields to be ignored from outlier filtering.
Array[String]: field names to be ignored from outlier filtering.
Setter
Setter
Configures the tails of a distribution to filter out, along with the ntile settings defined in: .setOutlierLowerFilterNTile() and/or .setOutlierUpperFilterNTile()
Available Modes:
"lower" -> filters out rows from the data that are below the value set in
.setOutlierLowerFilterNTile()
"upper" -> filter out rows from the data that are above the the value set in
.setOutlierUpperFilterNTile()
"both" -> two-tailed filter that combines both an "upper" and "lower" filter.
String: Tailed direction setting for outlier filtering.
This filter action is disabled by default. Before enabling, please ensure the fields to be filtered are
adequately reflected in the
inverse selection, as well as verifying the
general distribution of the fields that have outlier data in order to select an appropriate NTile value.
<u>This feature should only be supplied in rare instances and a full understanding of the impacts that this
filter may have should be understood before enabling it.</u>.setOutlierFieldsToIgnore()
Default: "both"
Boolean switch for setting the state of outlierFilterFlag
Boolean switch for setting the state of outlierFilterFlag
Boolean
Setter
Defines the precision (RSD) in which each field's cardinality is calculated through the use of
SparkSQL function.approx_count_distinct
Setter
Defines the precision (RSD) in which each field's cardinality is calculated through the use of
SparkSQL function. Lower values specify higher accuracy, but consume
more computational resources.
approx_count_distinct
Double: In range of 0.0, 1.0
IllegalArgumentException
if the value supplied is outside of the Range(0.0, 1.0)
A Value of 0.0 will be an exact computation of distinct values. Therefore, all data must be shuffled, which is an expensive task.
https://en.wikipedia.org/wiki/Coefficient_of_variation for explanation of RSD
Setter
Defines the NTILE value of the distributions of feature fields below which rows that fall beneath this value will
be filtered from the data.
Setter
Defines the NTILE value of the distributions of feature fields below which rows that fall beneath this value will
be filtered from the data.
Double: Lower Threshold boundary NTILE for Outlier Filtering
IllegalArgumentException
if the value supplied is outside of the Range(0.0,1.0)
Only used if Outlier filtering is set to 'On' and Filter Direction is either 'both' or 'lower'
Setter
Defines the NTILE value of the distributions of feature fields above which rows that fall above this value will
be filtered from the data
Setter
Defines the NTILE value of the distributions of feature fields above which rows that fall above this value will
be filtered from the data
Double: Upper Threshold boundary NTILE value for Outlier Filtering
IllegalArgumentException
if the value supplied is outside of the Range(0.0,1.0)
Only used if Outlier filtering is set to 'On' and Filter Direction is either 'both' or 'upper'
Setter
Provides the ntile threshold above or below which (depending on PearsonFilterDirection setting) fields will
be removed, depending on the distribution of pearson statistics from all feature columns.
Setter
Provides the ntile threshold above or below which (depending on PearsonFilterDirection setting) fields will
be removed, depending on the distribution of pearson statistics from all feature columns.
Double: In range of (0.0, 1.0)
IllegalArgumentException
if the value provided is outside of the range of (0.0, 1.0)
Default: 0.75 (Q3)
,WARNING - this feature is ONLY recommended to be used for exploratory development work.
Setter
Controls which direction of correlation values to filter out.
Setter
Controls which direction of correlation values to filter out. Allowable modes:
"greater" or "lesser"
String: one of available modes
IllegalArgumentException
if the value provided is not in available modes list.
Default: greater
Boolean switch for setting the state of pearsonFilterFlag
Boolean switch for setting the state of pearsonFilterFlag
Boolean
Setter
Controls the Pearson manual filter value, if the PearsonFilterMode is set to "manual"
Setter
Controls the Pearson manual filter value, if the PearsonFilterMode is set to "manual"
Double: A value that is used as a cut-off point to filter fields whose correlation statistic is either above or below will be culled from the feature vector.
with .setPearsonFilterMode("manual") and .setPearsonFilterDirection("greater")
the removal of fields that have a pearson correlation coefficient result above this
value will be dropped from modeling runs.
Setter
Controls whether to use "auto" mode (using the PearsonAutoFilterNTile) or "manual" mode (using the
PearsonFilterManualValue) to cull fields from the feature vector.
Setter
Controls whether to use "auto" mode (using the PearsonAutoFilterNTile) or "manual" mode (using the
PearsonFilterManualValue) to cull fields from the feature vector.
String: either "auto" or "manual"
IllegalArgumentException
if the value provided is not in available modes list (auto and manual)
Default: "auto"
Setter
Selection for filter statistic to be used in Pearson Filtering.
Available modes: "pvalue", "degreesFreedom", or "pearsonStat"
Setter
Selection for filter statistic to be used in Pearson Filtering.
Available modes: "pvalue", "degreesFreedom", or "pearsonStat"
String: one of available modes.
IllegalArgumentException
if the value provided is not in available modes list.
Default: pearsonStat
Boolean switch for setting the state of the scalingFlag
Boolean switch for setting the state of the scalingFlag
Boolean
Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta)
Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta)
Configuration string either 'persist' or 'delta'
0.7.1
Algorithm Config
Tuner Config
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistently not-improving runs to terminate the learning algorithm due to diminishing returns.
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistently not-improving runs to terminate the learning algorithm due to diminishing returns.
Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements will decrement a mutable counter. If the counter hits this limit specified in value, the continuous mode algorithm will stop).
0.6.0
IllegalArgumentException
if the value is positive.
Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely large data sets or a situation where using local disk storage might be prohibitively expensive)
Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely large data sets or a situation where using local disk storage might be prohibitively expensive)
String path to a dbfs location for creating the temporary (or persisted)
0.7.1
Setter for whether or not to delete the written train/test splits for the run in Delta.
Setter for whether or not to delete the written train/test splits for the run in Delta. Defaulted to true which means that the job will delete the data on Object store to clean itself up after the run is completed if the splitCachingStrategy is set to 'delta'
Boolean - true => delete false => leave on Object Store
0.7.1
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases.
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases. The larger this value (default: 10), the more potential space can be searched. There is not a large performance hit to this, and as such, values in excess of 100 are viable.
Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential candidates.
0.6.0
IllegalArgumentException
if the value is not greater than zero.
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
String - one of "XGBoost", "LinearRegression" or "RandomForest"
0.6.0
IllegalArgumentException
if the value is not supported
Setter - for overriding the cardinality threshold exception threshold.
Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster.
Int: the limit above which an exception will be thrown for a classification problem wherein the label distinct count is too large to successfully generate synthetic data.
0.5.1
Default: 20
Setter for specifying the number of K-Groups to generate in the KMeans model
Setter for specifying the number of K-Groups to generate in the KMeans model
Int: number of k groups to generate
this
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
String: Options -> "euclidean" or "cosine" Default: "euclidean"
this
IllegalArgumentException()
if an invalid value is entered
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Int: Maximum limit on iterations
this
Setter for the internal KMeans column for cluster membership attribution
Setter for the internal KMeans column for cluster membership attribution
String: column name for internal algorithm column for group membership
this
Setter for a KMeans seed for the clustering algorithm
Setter for a KMeans seed for the clustering algorithm
Long: Seed value
this
Setter for Setting the tolerance for KMeans (must be >0)
Setter for Setting the tolerance for KMeans (must be >0)
The tolerance value setting for KMeans
this
IllegalArgumentException()
if a value less than 0 is entered
reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans for further details.
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Int: Count of hash tables to use
this
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH for more information
Setter for the internal LSH output hash information column
Setter for the internal LSH output hash information column
String: column name for the internal MinHashLSH Model transformation value
this
Setter - for determining the label balance approach mode.
Setter - for determining the label balance approach mode.
String: one of: 'match', 'percentage' or 'target'
0.5.1
IllegalArgumentException
if the provided mode is not supported.
Default: "percentage"
,Available modes:
'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
'percentage' Will adjust smaller classes to a percentage value of the largest class count.
'target' Will increase smaller class counts to a fixed numeric target of rows.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
The minimum (or fixed) number of indexes to mutate.
this
In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will be mutated'
Setter for the Mutation Mode of the feature vector individual values
Setter for the Mutation Mode of the feature vector individual values
String: the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors "ratio" - uses a ratio between the values of the centroid vector and the mutation vector *
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Double: value between 0 and 1 for mutation magnitude adjustment.
this
IllegalArgumentException()
if the value specified is outside of the range (0, 1)
the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be.
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Double: A fractional double in the range of 0.0 to 1.0.
0.5.1
UnsupportedOperationException()
if the provided value is outside of the range of 0.0 -> 1.0
Default: 0.2
,Setting this value to 1.0 is equivalent to setting the label balance mode to 'match'
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Int: The desired final number of rows per minority class label
0.5.1
[WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure.
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Int: Number of vectors to find nearest each centroid within the class
this
the higher the value set here, the higher the variance in synthetic data generation
Setter - for setting the name of the Synthetic column name
Setter - for setting the name of the Synthetic column name
String: A column name that is uniquely not part of the main DataFrame
0.5.1
Setter for the Vector Mutation Method
Setter for the Vector Mutation Method
String - the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. "all" - will mutate all of the vectors.
Boolean switch for setting the state of varianceFilterFlag
Boolean switch for setting the state of varianceFilterFlag
Boolean (whether or not to filter out fields from the feature vector that all have the same value)
Boolean switch for turning variance filtering off
Boolean switch for turning variance filtering off
Default: On
Boolean switch for turning variance filtering on
Boolean switch for turning variance filtering on
Default: On
Setter
The threshold value that is used to detect, based on the supplied labelCol, the cardinality of the label through
a .distinct().count() being issued to the label column.
Setter
The threshold value that is used to detect, based on the supplied labelCol, the cardinality of the label through
a .distinct().count() being issued to the label column. Values from this cardinality determination that are
above this setter's value will be considered to be a Regression Task, those below will be considered a
Classification Task.
Int: Threshold value for the labelCol cardinality check. Values above this setting will be determined to be a regression task; below to be a classification task.
Default: 50
,In the case of exceptions being thrown for incorrect type (detected a classifier, but intended usage is for a regression, lower this value. Conversely, if a classification problem has a significant number of classes, above the default threshold of this setting (50), increase this value.)
Main Configuration Generator utility class, used for generating a modeling configuration to execute the autoML framework.
0.5