Utility for automatically assembling columns into a vector of features.
Params for automatic feature-vector assembler.
Simple evaluator based on the mllib.BinaryClassificationMetrics.
Estimator is used to select the proper item sample rate to achive desired size of the resulting sample.
Estimator is used to select the proper item sample rate to achive desired size of the resulting sample. Takes into consideration the source dataset size and the amount of valid for ranking lists (list with samples of different rank).
Model applied as a transformer, but the resulting data set is not determenistic (each pass produces different results).
Model applied as a transformer, but the resulting data set is not determenistic (each pass produces different results). Results must not be cached.
Follows ideas from Combined Regression and Ranking paper (http://www.decom.ufop.br/menotti/rp122/sem/sem2-alex-art.pdf)
Follows ideas from Combined Regression and Ranking paper (http://www.decom.ufop.br/menotti/rp122/sem/sem2-alex-art.pdf)
Can model pair-wise ranking task (sample pairs, substract features and label 1/0), can model point-wise regression, or can combine both by choosing whenever to sample single item or a pair.
Used to extract a set of columns from the underlying data frame based on names and/or SQL expresions.
Base class for combined model holding a named map of nested models.
Used to train and evaluate model in folds.
Created by dmitriybugaichenko on 10.11.16.
Created by dmitriybugaichenko on 10.11.16.
Implementation of a distributed version of Stochastic Variance Reduced Gradient Descent. The idea is taken from https://arxiv.org/abs/1512.01708 - input dataset is partitioned and workers performs descent simultaneously updating own copy of the weights at each random point (following SGD schema). At the end of epoche data from all workers are collected and aggregated. Variance reduction is achieved by keeping average gradient from previous iterations and evaluating gradient at one extra point (average of all weights seen during previous epoche). The update rule is:
w_new = w_old − η (∇f_i(w_old) − ∇f_i(w_avg) + g)
TODO: Other variance reduction and step size tuning techniques might be applied.
Requires AttributeGroup metadata for both labels and features, supports elastic net regularization and multiple parallel labels training (similar to MatrixLBFGS).
Helper class for training single-label models.
Base class for evaluators.
Base class for evaluators. It is expected that evaluators group data into some groups and then evaluate metrics for each of the groups.
Created by dmitriybugaichenko on 30.12.15.
Created by dmitriybugaichenko on 30.12.15.
Utility used for estimating extended stat for the set of vectors. In addition to mean, deviation and count estimates percentiles
Created by dmitriybugaichenko on 29.11.16.
Created by dmitriybugaichenko on 29.11.16.
This utility is used to perform external feature selection based on multi-fold evaluation and computing weights confidence intervals based on the weights from each fold.
Utility used to split training into forks (per type, per class, per fold).
Utility used to split training into forks (per type, per class, per fold).
Type of model produced by the nested estimator.
Type of the resulting model. Does not have to be the same as ModelIn.
Specific case of forked estimator which does not change the type of the underlying model.
Used for evaluators with batch support
For estimators capable of caching training data.
Adds parameter with column for instance classes.
Adds parameter wot classes weights (defaults to 1.0)
For vector assemblers used to provide better naming for metadata attrbiutes.
Parameters for specifying which columns to include or exclude.
Created by dmitriybugaichenko on 30.11.16.
Supplementary train used for optimization (moving transformation out of the execution plan into UDF)
Block with information regarding features significance stat, produced during the features selection stage.
Adds parameters for folding - number of folds and name of column with fold number.
For transformers performing grouping by a certain columns.
Adds parameter with the name of test/train split column
Metrics block is added by the evaluators.
Created by dmitriybugaichenko on 19.11.16.
Created by dmitriybugaichenko on 19.11.16.
Utility for simplifying BLAS access.
Used to indicate that last weight should not be considered as a part of regularization (typically if it is the intercept)
For transformers performing sorting by a certain columns.
Adds parameter with column for instance type.
Block produced by a models with concept of feature weights (eg.
Block produced by a models with concept of feature weights (eg. linear models).
Adds extra column to features vector with a fixed value of 1.
Adds extra column to features vector with a fixed value of 1. Can be used with any model.
:: Experimental :: Isotonic regression.
:: Experimental :: Isotonic regression.
Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.
Uses org.apache.spark.mllib.regression.IsotonicRegression.
ODKL Patch: Used to inject our patched mllib implementation.
ml.odkl is an extension to Spark ML package with intention to 1.
ml.odkl is an extension to Spark ML package with intention to 1. Provide a modular structure with shared and tested common code 2. Add ability to create train-only transformation (for better prediction performance) 3. Unify extra information generation by the model fitters 4. Support combined models with option for parallel training.
This particular file contains utility for serializing complex parameters using jackson (handles few types automatically which can not be handled by json4s)
Combination model which evaluates ALL nested model and combines results based on linear weights.
Single-label linear regresion with DSVRGD
Multi-label linear regresion with DSVRGD
Multi-label logistic regresion with DSVRGD
Multi-label logistic regresion with DSVRGD
Utility used to bridge default spark ML models into our advanced pipelines.
Utility used to bridge default spark ML models into our advanced pipelines. TODO: Provide summary extractors
Created by dmitriybugaichenko on 24.03.16.
Created by dmitriybugaichenko on 24.03.16.
Implementation for multi-class logistic regression training. In contrast to traditional notion of multi-class logistic regression this trainer produces one regression per each class. Internally treats all classes simultaneously using matrix-matrix multplication. Allows for L1-regularization (switches LBFGS to OWL-QN for that). Regularization strength is defined in terms of fraction of maximal feasible regularization (deduced using http://jmlr.org/papers/volume8/koh07a/koh07a.pdf).
Created by alexander.lutsenko on 20.09.16.
One of main extensions to the base concept of model - each model might return a summary represented by a named collection of dataframes.
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg.
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg. unscale weights or remove intercept). Also useful for extending summary blocks (eg. during evaluation/cross-validation).
This interface defines the logic of model transformation.
Model which has a summary.
Model which has a summary. Includes support for reading and wirting summary blocks.
Combination model which evaluates ALL nested model and returns vector.
Base class for models, evaluated per each class.
Utility for converting columns with string or a set of stings into a vector of 0/1 with the cardinality equal to the number of unique string values used.
Model produced by the multinominal extractor.
Model produced by the multinominal extractor. Knows the predefined set of values and maps strings/set of strings to vectors of 0/1 with cardinality equal to amount of known values.
Parameters for multinominal feature extractor.
Estimates mean values ignoring NaN's
Model used to replace values with pre-computed defaults before training/predicting.
Set of parameters for the replacer
Assuming there is a metadata attached to a integer field can be used to replace ints with corresponding attribute names.
Assuming there is a metadata attached to a integer field can be used to replace ints with corresponding attribute names. Used, for example in the validation pipeline to avoid attaching large strings to the validation results (eg. score/label descriptions) before the very end.
Utility used to replace null values with defaults (zero or false).
:: Experimental :: A feature transformer that merges multiple columns into a vector column.
:: Experimental :: A feature transformer that merges multiple columns into a vector column.
This class is a copy of VectorAssembler with two enhancements: support for nulls (replaced to NaNs) and pattern matching extracted from the inner loop.
Evaluator used to compute metrics for predictions grouped by a certain criteria (typically by a user id).
Evaluator used to compute metrics for predictions grouped by a certain criteria (typically by a user id). Materializes all the predictions for a criteria in memory and calculates multiple metrics. Can be used only for fine-grained grouping criteria. Supports mutli-label and multi-score cross evaluation (computes metrics for each label-score combinations if provided with vectors instead of scalars).
Settings for partitioning, except the number of partitions.
Settings for partitioning, except the number of partitions. Is extended by static and dynamic partitioners.
This is a specific implementation of the scaler for linear models.
This is a specific implementation of the scaler for linear models. Uses the ability to propagate scaling to the weights to avoid overhead when predicting.
Scaler parameters.
Selecting model applies exactly one model based on instance type and return its result.
Serializable wrapper over the TDigest
Serializable wrapper over the TDigest
Simple utility used to apply SQL WHERE filter
Estimator with produces model with summary.
Estimator with produces model with summary. Used to simplify chaining.
Created by eugeny.malyutin on 20.07.17.
Created by eugeny.malyutin on 20.07.17.
Performs TopK-UDAF logic without annoying schema pack-unpack
- raw type (Long for LongTyped-columns) for columnToOrderBy Ordering for this type should be defined
Created by eugeny.malyutin on 24.06.16.
Created by eugeny.malyutin on 24.06.16.
UDAF designed to extract top-numRows rows by columnValue Used to replace Hive Window-functions which are to slow in case of all-df in one aggregation cell Result of aggFun is packed in a column "arrData" and need to be org.apache.spark.sql.functions.explode-d
- type of columnToSortBy with implicit ordering for type B
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg.
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg. unscale weights or remove intercept). Also useful for extending summary blocks (eg. during evaluation/cross-validation).
This class is used as a typical pipeline stage while training (fits and applies transformer, then calls the nested estimator), but it automatically eliminates itself from the resulting model by applying model transformer.
Utility used to extract nested values from vectors into dedicated columns.
Utility used to extract nested values from vectors into dedicated columns. Requires vector metadata and extracts names from where. Typically used as a final stage before results visualization.
Utility used to collect detailed stat for vectors grouped by a certain keys.
Utility used to collect detailed stat for vectors grouped by a certain keys. In addition to common stuff (mean, variance, min/max, norms) calculates percentiles as configured. Resulting dataframe contains only columns from the key and stat columns.
Utility used for reporting single indexed feature weight.
Adds read logic
Adds read ability.
Helper used to inject common task support with thread count limit into all forked estimators.
Created by dmitriybugaichenko on 19.11.16.
Created by dmitriybugaichenko on 19.11.16.
Utility alowing access of certain hidden methods of Spark's mllib linalg
Helper for reading and writing models in a typed way.
Adds read logic
Adds read ability
Adds support for reading.
Adds read ability.
Utility for automatically assembling columns into a vector of features. Takes either all the columns, or a subset of them. For boolean, numeric and vector columns uses default vectorising logic, for string and collection columns applies nominalizers.