Helper method for calculating the Information Gain of a feature field
Helper method for calculating the Information Gain of a feature field
DataFrame that contains at least the fieldToTest and the Label Column
The field to calculate Information Gain for
Total number of records in the data set
The Information Gain of the field
0.7.0
Method for calculating the variance of a categorical (nominal) field based on a post-split first-layer variance of the label column's values to determine the minimum variance achievable in the label column.
Method for calculating the variance of a categorical (nominal) field based on a post-split first-layer variance of the label column's values to determine the minimum variance achievable in the label column.
DataFrame that contains the label column and the field under test for minimum by-group variance
The label column of the data set
The feature column to test for variance reduction
The minimum split variance of the aggregated label data by nominal group of the fieldToTest
0.7.0
Helper method for handling Information Gain Calculation for classification data set when dealing with continuous (numeric) feature elements.
Helper method for handling Information Gain Calculation for classification data set when dealing with continuous (numeric) feature elements. The continuous feature will be split upon the configured value of _continuousDiscretizerBucketCount, which is set by overriding .setContinuousDiscretizerBucketCount(<Int>)
DataFrame that contains the feature to test and the label column
The feature field that is under test for entropy evaluation
Total number of elements in the data set.
Information Gain associated with the feature field based on splits that could occur.
0.7.0
Method for calculating the variance of a continuous field for variance reduction in the label column based on bucketized grouping of the field under test.
Method for calculating the variance of a continuous field for variance reduction in the label column based on bucketized grouping of the field under test.
DataFrame that contains the label column and the field under test of continuous numeric type
The label column of the data set
The field to test (continuous numeric) that need to be evaluated
The number of quantized buckets to create to group the field under test into in order to simulate where a decision split would occur.
The minimum split variance of each of the buckets that have been created
0.7.0
Method for evaluating the percentage change to the score metric to normalize.
Method for evaluating the percentage change to the score metric to normalize.
Score of a parent feature
Score of an interaction feature
the percentage change
0.6.2
Helper method for converting a continuous feature to a discrete bucketed value so that entropy can be calculated effectively for the feature.
Helper method for converting a continuous feature to a discrete bucketed value so that entropy can be calculated effectively for the feature.
DataFrame containing at least the field to test in continuous numeric format
The name of the field under conversion
A Dataframe with the continuous value converted to a quantized bucket membership value.
0.7.0
Helper method for extracting field names and ensuring that the feature vector is present
Helper method for extracting field names and ensuring that the feature vector is present
Schema of the DataFrame undergoing feature interaction
The name of the features column
Array of column names of the DataFrame
0.6.2
Method for generating a collection of Interaction Candidates to be tested and applied to the feature set if the tests for inclusion pass.
Method for generating a collection of Interaction Candidates to be tested and applied to the feature set if the tests for inclusion pass.
List of the columns that make up the feature vector
Array of InteractionPayload values.
0.6.2
Method for converting nominal interaction fields to a new StringIndexed value to preserve information type and eliminate the possibility of data distribution skew
Method for converting nominal interaction fields to a new StringIndexed value to preserve information type and eliminate the possibility of data distribution skew
FeatureInteractionCollection of the source parents and their interacted children fields
NominalDataCollecction payload containing a DataFrame that has new StringIndexed fields for nominal interactions and the fields that need to be seen as included in the final feature vector
0.6.2
Method for generating a product interaction between feature columns
Method for generating a product interaction between feature columns
A DataFrame to add a field for an interaction between two columns
InteractionPayload information about the two parent columns and the name of the new interaction column to be created.
A modified DataFrame with the new column.
0.6.2
Helper method for recreating the feature vector after interactions have been completed on individual columns
Helper method for recreating the feature vector after interactions have been completed on individual columns
DataFrame containing the interacted fields with the original feature vector dropped
Fields making up the original vector before interaction
Interaction candidate fields that have been selected to be included in the final feature vector
Name of the feature vector field
DataFrame with a new feature vector.
0.6.2