Param for how to handle invalid entries.
Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Note that in the multiple columns case, the invalid handling is applied to all columns. That said for 'error' it will throw an error if any invalids are found in any column, for 'skip' it will skip rows with any invalids in any columns, etc. Default: "error"
Number of buckets (quantiles, or categories) into which data points are grouped.
Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.
See also handleInvalid, which can optionally create an additional bucket for NaN values.
default: 2
Array of number of buckets (quantiles, or categories) into which data points are grouped.
Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2
See also handleInvalid, which can optionally create an additional bucket for NaN values.
Relative error (see documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for description)
Must be in the range [0, 1].
Relative error (see documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for description)
Must be in the range [0, 1].
Note that in multiple columns case, relative error is applied to all columns.
default: 0.001
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using thenumBuckets
parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0,QuantileDiscretizer
can map multiple columns at once by setting theinputCols
parameter. If both of theinputCol
andinputCols
parameters are set, an Exception will be thrown. To specify the number of buckets for each column, thenumBucketsArray
parameter can be set, or if the number of buckets should be the same across columns,numBuckets
can be set as a convenience.NaN handling: null and NaN values will be ignored from the column during
QuantileDiscretizer
fitting. This will produce aBucketizer
model for making predictions. During the transformation,Bucketizer
will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by settinghandleInvalid
. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with therelativeError
parameter. The lower and upper bin bounds will be-Infinity
and+Infinity
, covering all real values.