Method for calculating the remaining time left on the genetic algorithm training (roughly)
Method for calculating the remaining time left on the genetic algorithm training (roughly)
The current Generation that the model is running on
The index of the current model that is being run.
A Double representing the total completion percentage of the modeling portion of the run.
0.2.1
Due to the asynchronous nature of the algorithm, the times are not exact and are a reflection of time since the creation of the Futures and when they were initially inserted into the thread pool.
Method for validating the distinct class count for a classification type model (for use in determining which evaluator to employ for scoring and optimization of each model)
Method for validating the distinct class count for a classification type model (for use in determining which evaluator to employ for scoring and optimization of each model)
source Dataframe (prior to splitting for train/test)
Boolean true for Binary Classification problem, false for multi-class problem
0.4.0
Method for restricting the available metrics used or are available for optimizing for classification problems
Method for restricting the available metrics used or are available for optimizing for classification problems
boolean check from classificationAdjudicator() method
the hard-coded allowable List[String] of allowable classification metrics from com.databricks.labs.automl.params.EvolutionDefaults
a copy of the the allowable params list with the Binary metrics removed if this is a multiclass problem.
0.4.0
Method for scoring and evaluating classification models (supporting both multi-class and binary classification problems)
Method for scoring and evaluating classification models (supporting both multi-class and binary classification problems)
the metric to be tested against (both for binary and multi-class)
the column name in the data set that is the 'source of truth' to compare against
the DataFrame that has been transformed
the score, as a Double value.
0.4.0
Helper function for partially updating a numeric mapping
Helper function for partially updating a numeric mapping
The default configuration Map for a numeric mapping for model hyperparameter search space
user-supplied updated map (doesn't have to have all elements in it)
The default map, updated with the user-supplied overrides
0.6.1
Helper function for partially updating a string mapping
Helper function for partially updating a string mapping
The default configuration Map for a string mapping for model hyperparameter search space
user-supplied updated map (doesn't have to have all elements in it)
The default map, updated with the user-supplied overrides
0.6.1
Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential best-condition of hyper parameter configurations.
Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential best-condition of hyper parameter configurations.
Array of GBT Configuration (hyper parameter settings) from the post-run model inference
The results of the hyper parameter test, as well as the scored DataFrame report.
Method for scoring Regression models.
Method for scoring Regression models.
The metric desired to be tested
The name of the label column
the DataFrame that has been transformed by a model.
the score for the metric, as a Double value.
0.4.0
Setter - for overriding the cardinality threshold exception threshold.
Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster.
Int: the limit above which an exception will be thrown for a classification problem wherein the label distinct count is too large to successfully generate synthetic data.
0.5.1
Default: 20
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt not-improving runs to terminate the learning algorithm due to diminishing returns.
Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt not-improving runs to terminate the learning algorithm due to diminishing returns.
Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements will decrement a mutable counter. If the counter hits this limit specified in value, the continuous mode algorithm will stop).
0.6.0
IllegalArgumentException
if the value is positive.
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases.
Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through mutation for each generation other than the initial and post-modeling optimization phases. The larger this value (default: 10), the more potential space can be searched. There is not a large performance hit to this, and as such, values in excess of 100 are viable.
Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential candidates.
0.6.0
IllegalArgumentException
if the value is not greater than zero.
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates
String - one of "XGBoost", "LinearRegression" or "RandomForest"
0.6.0
IllegalArgumentException
if the value is not supported
Setter for specifying the number of K-Groups to generate in the KMeans model
Setter for specifying the number of K-Groups to generate in the KMeans model
Int: number of k groups to generate
this
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
String: Options -> "euclidean" or "cosine" Default: "euclidean"
this
IllegalArgumentException()
if an invalid value is entered
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Int: Maximum limit on iterations
this
Setter for the internal KMeans column for cluster membership attribution
Setter for the internal KMeans column for cluster membership attribution
String: column name for internal algorithm column for group membership
this
Setter for a KMeans seed for the clustering algorithm
Setter for a KMeans seed for the clustering algorithm
Long: Seed value
this
Setter for Setting the tolerance for KMeans (must be >0)
Setter for Setting the tolerance for KMeans (must be >0)
The tolerance value setting for KMeans
this
IllegalArgumentException()
if a value less than 0 is entered
reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans for further details.
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Int: Count of hash tables to use
this
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH for more information
Setter for the internal LSH output hash information column
Setter for the internal LSH output hash information column
String: column name for the internal MinHashLSH Model transformation value
this
Setter for the LSH Seed for the model
Setter - for determining the label balance approach mode.
Setter - for determining the label balance approach mode.
String: one of: 'match', 'percentage' or 'target'
0.5.1
UnsupportedOperationException()
if the provided mode is not supported.
Default: "percentage"
,Available modes:
'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
'percentage' Will adjust smaller classes to a percentage value of the largest class count.
'target' Will increase smaller class counts to a fixed numeric target of rows.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
The minimum (or fixed) number of indexes to mutate.
this
In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will be mutated'
Setter for the Mutation Mode of the feature vector individual values
Setter for the Mutation Mode of the feature vector individual values
String: the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors "ratio" - uses a ratio between the values of the centroid vector and the mutation vector *
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Double: value between 0 and 1 for mutation magnitude adjustment.
this
IllegalArgumentException()
if the value specified is outside of the range (0, 1)
the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be.
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode()
Double: A fractional double in the range of 0.0 to 1.0.
0.5.1
UnsupportedOperationException()
if the provided value is outside of the range of 0.0 -> 1.0
Default: 0.2
,Setting this value to 1.0 is equivalent to setting the label balance mode to 'match'
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode()
Int: The desired final number of rows per minority class label
0.5.1
[WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure.
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Int: Number of vectors to find nearest each centroid within the class
this
the higher the value set here, the higher the variance in synthetic data generation
Setter - for setting the name of the Synthetic column name
Setter - for setting the name of the Synthetic column name
String: A column name that is uniquely not part of the main DataFrame
0.5.1
Setter for the Vector Mutation Method
Setter for the Vector Mutation Method
String - the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. "all" - will mutate all of the vectors.
Internal method for validating if a numeric mapping that is specified contains any invalid keys
Internal method for validating if a numeric mapping that is specified contains any invalid keys
The static defined numeric mapping for a model type
a user-specified mapping override
0.6.1
IllegalArgumentException
if the key is invalid for the model type specified.
Internal method for validating if a string mapping that is specified contains any invalid keys
Internal method for validating if a string mapping that is specified contains any invalid keys
The static defined string mapping for a model type
a user-specified mapping override
0.6.1
IllegalArgumentException
if the key is invalid for the model type specified.