Split methodology for getting test and train of KSample up-sampled data.
Both data sets are split into test and train.
Split methodology for getting test and train of KSample up-sampled data.
Both data sets are split into test and train.
The returned collections are a union of the real train + synthetic train, but only the real test data.
DataFrame: The full data set (containing a synthetic column that indicates whether the data is real or not)
Long: A seed value that is consistent across both data sets
Array[Row]: The unique entries of the label values
Array[DataFrame] of Array(trainData, testData)
0.5.1
Method for stratification of the test/train around the unique values of the label column This mode is recommended for label value distributions in classification that have relatively balanced and uniformly distributed instances of the classes.
Method for stratification of the test/train around the unique values of the label column This mode is recommended for label value distributions in classification that have relatively balanced and uniformly distributed instances of the classes. If there is significant skew, it is highly recommended to use under or over sampling.
Dataframe that is the input to the train/test split
random seed for splitting the data into train/test.
An Array of Dataframes: Array[<trainData>, <testData>]