Public method for returning the current state of the configuration as a new instance of the KSamplingConfiguration
Public method for returning the current state of the configuration as a new instance of the KSamplingConfiguration
the current state of the KSamplingConfiguration conf
Setter for the Feature Column name of the input DataFrame
Setter for the Feature Column name of the input DataFrame
String: name of the feature vector column
this
Setter to provide a listing of any fields that are intended to be ignored in the generated dataframe
Setter to provide a listing of any fields that are intended to be ignored in the generated dataframe
Array[String]: field names to ignore in the data generation aspect
this
Setter for specifying the number of K-Groups to generate in the KMeans model
Setter for specifying the number of K-Groups to generate in the KMeans model
Int: number of k groups to generate
this
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
Setter for which distance measurement to use to calculate the nearness of vectors to a centroid
String: Options -> "euclidean" or "cosine" Default: "euclidean"
this
IllegalArgumentException()
if an invalid value is entered
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Setter for specifying the maximum number of iterations for the KMeans model to go through to converge
Int: Maximum limit on iterations
this
Setter for the internal KMeans column for cluster membership attribution
Setter for the internal KMeans column for cluster membership attribution
String: column name for internal algorithm column for group membership
this
Setter for a KMeans seed for the clustering algorithm
Setter for a KMeans seed for the clustering algorithm
Long: Seed value
this
Setter for Setting the tolerance for KMeans (must be >0)
Setter for Setting the tolerance for KMeans (must be >0)
The tolerance value setting for KMeans
this
IllegalArgumentException()
if a value less than 0 is entered
reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans for further details.
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Setter for Configuring the number of Hash Tables to use for MinHashLSH
Int: Count of hash tables to use
this
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH for more information
Setter for the internal LSH output hash information column
Setter for the internal LSH output hash information column
String: column name for the internal MinHashLSH Model transformation value
this
Setter for a MinHashLSH seed value for the model.
Setter for a MinHashLSH seed value for the model.
Long: a seed value
this
Setter for the Label Column name of the input DataFrame
Setter for the Label Column name of the input DataFrame
String: name of the label column
this
Setter for minimum threshold for vector indexes to mutate within the feature vector.
Setter for minimum threshold for vector indexes to mutate within the feature vector.
The minimum (or fixed) number of indexes to mutate.
this
In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will be mutated'
Setter for the Mutation Mode of the feature vector individual values
Setter for the Mutation Mode of the feature vector individual values
String: the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors "ratio" - uses a ratio between the values of the centroid vector and the mutation vector *
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode
Double: value between 0 and 1 for mutation magnitude adjustment.
this
IllegalArgumentException()
if the value specified is outside of the range (0, 1)
the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be.
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data
Int: Number of vectors to find nearest each centroid within the class
this
the higher the value set here, the higher the variance in synthetic data generation
Setter for the name to be used for the synthetic column flag that is attached to the output dataframe as an indication that the data present is generated and not original.
Setter for the name to be used for the synthetic column flag that is attached to the output dataframe as an indication that the data present is generated and not original.
String: name to be used throughout the job to delineate the fact that the data in the row is generated.
this
Setter for the Vector Mutation Method
Setter for the Vector Mutation Method
String - the mode to use.
this
IllegalArgumentException()
if the mode is not supported.
Options: "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. "all" - will mutate all of the vectors.