odkl

Type Members

class AutoAssembler extends Estimator[PipelineModel] with AutoAssemblerParams with DefaultParamsWritable

Utility for automatically assembling columns into a vector of features.
Utility for automatically assembling columns into a vector of features. Takes either all the columns, or a subset of them. For boolean, numeric and vector columns uses default vectorising logic, for string and collection columns applies nominalizers.
trait AutoAssemblerParams extends HasColumnsSets with HasOutputCol with HasColumnAttributeMap

Params for automatic feature-vector assembler.
class BinaryClassificationEvaluator extends Evaluator[BinaryClassificationEvaluator]

Simple evaluator based on the mllib.BinaryClassificationMetrics.
class CRRSamplerEstimator extends Estimator[CRRSamplerModel] with DefaultParamsWritable with CRRSamplerParams

Estimator is used to select the proper item sample rate to achive desired size of the resulting sample.
Estimator is used to select the proper item sample rate to achive desired size of the resulting sample. Takes into consideration the source dataset size and the amount of valid for ranking lists (list with samples of different rank).
class CRRSamplerModel extends Model[CRRSamplerModel] with DefaultParamsWritable with CRRSamplerParams with HasNetlibBlas

Model applied as a transformer, but the resulting data set is not determenistic (each pass produces different results).
Model applied as a transformer, but the resulting data set is not determenistic (each pass produces different results). Results must not be cached.
trait CRRSamplerParams extends HasInputCol with HasGroupByColumns with HasLabelCol

Follows ideas from Combined Regression and Ranking paper (http://www.decom.ufop.br/menotti/rp122/sem/sem2-alex-art.pdf)
Follows ideas from Combined Regression and Ranking paper (http://www.decom.ufop.br/menotti/rp122/sem/sem2-alex-art.pdf)
Can model pair-wise ranking task (sample pairs, substract features and label 1/0), can model point-wise regression, or can combine both by choosing whenever to sample single item or a pair.
case class ClassPathExpression(filter: String, replacements: Array[(String, String)]) extends Product with Serializable
class ColumnsExtractor extends Transformer with DefaultParamsWritable

Used to extract a set of columns from the underlying data frame based on names and/or SQL expresions.
class CombinedLinearModelUnwrappedFeatureSelector[M <: LinearModel[M], C <: CombinedModel[M, C]] extends GenericFeatureSelector[CombinedLinearModelUnwrappedFeatureSelector[M, C]] with ModelTransformer[C, CombinedLinearModelUnwrappedFeatureSelector[M, C]]
abstract class CombinedModel[M <: ModelWithSummary[M], C <: CombinedModel[M, C]] extends Model[C] with ModelWithSummary[C] with HasDescriminantColumn with HasDirectTransformOption with HasPredictionCol with ForkedModelParams

Base class for combined model holding a named map of nested models.
class CompositScaleEstimator[M <: LinearModel[M], C <: CombinedModel[M, C]] extends ScalerEstimator[C]
class CrossValidator[M <: ModelWithSummary[M]] extends ForkedEstimatorSameType[M, Int] with HasIsTestCol with HasFolds with MetricsExtractor with HasMetricsBlock

Used to train and evaluate model in folds.
abstract class DSVRGD[M <: ModelWithSummary[M]] extends Estimator[M] with SummarizableEstimator[M] with HasPredictionCol with HasFeaturesCol with HasLabelCol with HasRegParam with HasElasticNetParam with HasNetlibBlas with HasMaxIter with HasTol with HasCacheTrainData

Created by dmitriybugaichenko on 10.11.16.
Created by dmitriybugaichenko on 10.11.16.
Implementation of a distributed version of Stochastic Variance Reduced Gradient Descent. The idea is taken from https://arxiv.org/abs/1512.01708 - input dataset is partitioned and workers performs descent simultaneously updating own copy of the weights at each random point (following SGD schema). At the end of epoche data from all workers are collected and aggregated. Variance reduction is achieved by keeping average gradient from previous iterations and evaluating gradient at one extra point (average of all weights seen during previous epoche). The update rule is:
w_new = w_old − η (∇f_i(w_old) − ∇f_i(w_avg) + g)
TODO: Other variance reduction and step size tuning techniques might be applied.
Requires AttributeGroup metadata for both labels and features, supports elastic net regularization and multiple parallel labels training (similar to MatrixLBFGS).
abstract class DeVectorizedDSVRGD[M <: ModelWithSummary[M]] extends DSVRGD[M]

Helper class for training single-label models.
abstract class Evaluator[S <: Evaluator[S]] extends Transformer with HasLabelCol with HasPredictionCol

Base class for evaluators.
Base class for evaluators. It is expected that evaluators group data into some groups and then evaluate metrics for each of the groups.
class ExponentialVectorDiscountTransformer extends Transformer with DefaultParamsWritable with HasGroupByColumns

Created by eugeny.malyutin on 18.02.18.
Created by eugeny.malyutin on 18.02.18.
Transformer to implement exponential weighted discounting for vectors; Expects dataFrame with structure ( $"groupByColumns", $"timestamp", $"vector")
Return dataFrame ( $"groupByColumns", $"timestamp", $"vector)
$"timestamp" - last seen action timestamp for this $"identificator" $"vector" - summed actions. vector(0) is reserved for "aggregation" timestamp
class ExtendedMultivariateOnlineSummarizer extends MultivariateOnlineSummarizer with Serializable with Logging

Created by dmitriybugaichenko on 30.12.15.
Created by dmitriybugaichenko on 30.12.15.
Utility used for estimating extended stat for the set of vectors. In addition to mean, deviation and count estimates percentiles
class FoldedFeaturesStatsAggregator[SelectingModel <: ModelWithSummary[SelectingModel] with HasWeights] extends Transformer with HasFeaturesSignificance with HasWeights with HasFeaturesCol

Created by dmitriybugaichenko on 29.11.16.
Created by dmitriybugaichenko on 29.11.16.
This utility is used to perform external feature selection based on multi-fold evaluation and computing weights confidence intervals based on the weights from each fold.
trait ForkSource[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType, ModelOut <: ModelWithSummary[ModelOut]] extends AnyRef
abstract class ForkedEstimator[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType, ModelOut <: ModelWithSummary[ModelOut]] extends Estimator[ModelOut] with SummarizableEstimator[ModelOut] with ForkedModelParams with HasNumThreads

Utility used to split training into forks (per type, per class, per fold).
Utility used to split training into forks (per type, per class, per fold).
ModelIn
Type of model produced by the nested estimator.
ModelOut
Type of the resulting model. Does not have to be the same as ModelIn.
abstract class ForkedEstimatorSameType[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType] extends ForkedEstimator[ModelIn, ForeKeyType, ModelIn]

Specific case of forked estimator which does not change the type of the underlying model.
trait ForkedModelParams extends AnyRef

class ForkedSparkEstimator[M <: ModelWithSummary[M] with MLWritable, E <: SummarizableEstimator[M] with MLWritable] extends Estimator[M] with SummarizableEstimator[M] with MLWritable

This utility is used to support evaluation of the part of pipeline in a separate Spark app.

This utility is used to support evaluation of the part of pipeline in a separate Spark app. There are at least three identified use cases: 1. Spark App with different settings for ETL and ML 2. Support for larger fork factor in segmented hyperopt (scale driver if it became a bootleneck) 3. Support for parallel XGBoost training (resolves internal conflict on the Rabbit part)

Simple example with linear SGD and Zeppelin in yarn-client mode:

// This estimator will start new Spark app from an app running in yarn-cluster mode
val secondLevel = new ForkedSparkEstimator[LinearRegressionModel, LinearRegressionSGD](new LinearRegressionSGD().setCacheTrainData(true))
            .setTempPath("tmp/forkedModels")
            // Match only files transfered with the app, re-point to the hdfs for faster start
            .withClassPathPropagation(".*__spark_libs__.*", ".+/" -> "hdfs://my-hadoop-nn/spark/lib/")
            // These files are localy available on all nodes
            .withClassPathPropagation("/opt/.*", "^/" -> "local://")
            // For convinience propagate configuration when working in non-interactive mode
            .setPropagateConfig(true)
            .setConfOverrides(
                // Enable log aggregation and disable dynamic allocation
                "spark.hadoop.yarn.log-aggregation-enable" -> "true",
                "spark.dynamicAllocation.enabled" -> "false",
                // These files might sneeak in when submited from Zeppelin, suppress them
                "spark.yarn.dist.jars" -> "",
                "spark.yarn.dist.files" -> "",
                "spark.yarn.dist.archives" -> ""
                )
            .setMaster("yarn")
            .setDeployMode("cluster")
            .setSubmitArgs(
                "--num-executors", "1")
            .setName("secondLevel")

// This estimator is will start neq Spark app from an interactive Zeppelin session
val firstLevel = new ForkedSparkEstimator[LinearRegressionModel, ForkedSparkEstimator[LinearRegressionModel,LinearRegressionSGD]](secondLevel)
        .setTempPath("tmp/forkedModels")
        // Propagate only odkl-analysiss jars, repoint to HDFS for faster start
        .withClassPathPropagation("/home/.*", ".+/" -> "hdfs://my-hadoop-nn/user/myuser/spark/lib/")
        // Do not propagate hell a lot of Zeppelin configs, rely on spark-defaults
        .setPropagateConfig(false)
        .setConfOverrides(
            // Enable log aggregation and disable dynamic execution
            "spark.hadoop.yarn.log-aggregation-enable" -> "true",
            "spark.dynamicAllocation.enabled" -> "false",
            // This is required to be able to start new spark apps from our app
            "spark.yarn.appMasterEnv.HADOOP_CONF_DIR" -> "/opt/hadoop/etc/hadoop/",
            // This is required to make sure Zeppelin does not full us the we are a Python app
            "spark.yarn.isPython" -> "false"
             )
        .setMaster("yarn")
        .setDeployMode("cluster")
        .setSubmitArgs(
            "--num-executors", "1")
        .setName("firstLevel")


val doubleForkedPipeiline = new Pipeline().setStages(Array(
    new VectorAssembler()
        .setInputCols(Array("first", "second"))
        .setOutputCol("features"),
    firstLevel
    ))

abstract class GenericFeatureSelector[M <: ModelWithSummary[M]] extends Model[M] with ModelWithSummary[M] with HasFeaturesCol
trait HasBatchSize extends Params

Used for evaluators with batch support
trait HasCacheTrainData extends AnyRef

For estimators capable of caching training data.
trait HasClassesCol extends Params

Adds parameter with column for instance classes.
trait HasClassesWeights extends Params

Adds parameter wot classes weights (defaults to 1.0)
trait HasColumnAttributeMap extends AnyRef

For vector assemblers used to provide better naming for metadata attrbiutes.
trait HasColumnsSets extends Params

Parameters for specifying which columns to include or exclude.
trait HasDescriminantColumn extends Params

Created by dmitriybugaichenko on 30.11.16.
trait HasDirectTransformOption extends Transformer

Supplementary train used for optimization (moving transformation out of the execution plan into UDF)
trait HasFeaturesSignificance extends AnyRef

Block with information regarding features significance stat, produced during the features selection stage.
trait HasFolds extends Params

Adds parameters for folding - number of folds and name of column with fold number.
trait HasGroupByColumns extends AnyRef

For transformers performing grouping by a certain columns.
trait HasIsTestCol extends Params

Adds parameter with the name of test/train split column
trait HasLossHistory extends AnyRef
trait HasMetricsBlock extends AnyRef

Metrics block is added by the evaluators.
trait HasNetlibBlas extends AnyRef

Created by dmitriybugaichenko on 19.11.16.
Created by dmitriybugaichenko on 19.11.16.
Utility for simplifying BLAS access.
trait HasNumThreads extends AnyRef
trait HasRegularizeLast extends Params

Used to indicate that last weight should not be considered as a part of regularization (typically if it is the intercept)
trait HasSortByColumns extends AnyRef

For transformers performing sorting by a certain columns.
trait HasTypeCol extends Params

Adds parameter with column for instance type.
trait HasWeights extends AnyRef

Block produced by a models with concept of feature weights (eg.
Block produced by a models with concept of feature weights (eg. linear models).
class Interceptor extends Transformer with HasFeaturesCol with DefaultParamsWritable

Adds extra column to features vector with a fixed value of 1.
Adds extra column to features vector with a fixed value of 1. Can be used with any model.
class IsotonicRegression extends regression.IsotonicRegression

:: Experimental :: Isotonic regression.
:: Experimental :: Isotonic regression.
Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.
Uses org.apache.spark.mllib.regression.IsotonicRegression.
ODKL Patch: Used to inject our patched mllib implementation.

Annotations
@Since( "1.5.0" ) @Experimental()
class JacksonParam[T] extends Param[T] with Logging

ml.odkl is an extension to Spark ML package with intention to 1.
ml.odkl is an extension to Spark ML package with intention to 1. Provide a modular structure with shared and tested common code 2. Add ability to create train-only transformation (for better prediction performance) 3. Unify extra information generation by the model fitters 4. Support combined models with option for parallel training.
This particular file contains utility for serializing complex parameters using jackson (handles few types automatically which can not be handled by json4s)
class LinearCombinationModel[N <: ModelWithSummary[N]] extends MultiClassCombinationModelBase[N, LinearCombinationModel[N]]

Combination model which evaluates ALL nested model and combines results based on linear weights.
class LinearDSVRGD extends DeVectorizedDSVRGD[LinearRegressionModel]

Single-label linear regresion with DSVRGD
abstract class LinearEstimator[M <: LinearModel[M], T <: LinearEstimator[M, T]] extends Predictor[Vector, T, M] with SummarizableEstimator[M] with LinearModelParams with HasWeights
class LinearMatrixDSVRGD extends DSVRGD[LinearCombinationModel[LinearRegressionModel]]

Multi-label linear regresion with DSVRGD
abstract class LinearModel[M <: LinearModel[M]] extends PredictionModel[Vector, M] with DirectPredictionModel[Vector, M] with ModelWithSummary[M] with LinearModelParams with HasWeights
trait LinearModelParams extends PredictorParams
class LinearModelSignificantFeatureSelector[ResultModel <: LinearModel[ResultModel]] extends SignificantFeatureSelector[LinearModelUnwrappedFeatureSelector[ResultModel]]
class LinearModelUnwrappedFeatureSelector[M <: LinearModel[M]] extends GenericFeatureSelector[LinearModelUnwrappedFeatureSelector[M]] with ModelTransformer[M, LinearModelUnwrappedFeatureSelector[M]]
class LinearRegressionModel extends LinearModel[LinearRegressionModel]
class LinearRegressionSGD extends LinearRegressor[LinearRegressionModel, GradientDescent, LinearRegressionSGD] with HasRegParam with HasTol with HasMaxIter with HasStepSize
abstract class LinearRegressor[M <: LinearModel[M], O <: Optimizer, T <: LinearRegressor[M, O, T]] extends LinearEstimator[M, T] with DefaultParamsWritable with HasCacheTrainData
class LinearScaleEstimator[M <: LinearModel[M]] extends ScalerEstimator[M]
class LogisticDSVRGD extends DeVectorizedDSVRGD[LogisticRegressionModel]

Multi-label logistic regresion with DSVRGD
class LogisticMatrixDSVRGD extends DSVRGD[LinearCombinationModel[LogisticRegressionModel]]

Multi-label logistic regresion with DSVRGD
class LogisticRegressionLBFSG extends LinearRegressor[LogisticRegressionModel, LogisticRegressionLBFSG, LogisticRegressionLBFSG] with HasRegParam with HasTol with HasMaxIter with Optimizer with HasElasticNetParam with HasRegularizeLast with HasBatchSize
class LogisticRegressionModel extends LinearModel[LogisticRegressionModel]
class MLWrapper[M <: Model[M]] extends Estimator[MLWrapperModel[M]] with SummarizableEstimator[MLWrapperModel[M]]

Utility used to bridge default spark ML models into our advanced pipelines.
Utility used to bridge default spark ML models into our advanced pipelines. TODO: Provide summary extractors
class MLWrapperModel[M <: Model[M]] extends Model[MLWrapperModel[M]] with ModelWithSummary[MLWrapperModel[M]]
class MatrixLBFGS extends Estimator[LinearCombinationModel[LogisticRegressionModel]] with SummarizableEstimator[LinearCombinationModel[LogisticRegressionModel]] with PredictorParams with HasTol with HasMaxIter with HasRegParam with HasRegularizeLast with HasBatchSize with HasNetlibBlas

Created by dmitriybugaichenko on 24.03.16.
Created by dmitriybugaichenko on 24.03.16.
Implementation for multi-class logistic regression training. In contrast to traditional notion of multi-class logistic regression this trainer produces one regression per each class. Internally treats all classes simultaneously using matrix-matrix multplication. Allows for L1-regularization (switches LBFGS to OWL-QN for that). Regularization strength is defined in terms of fraction of maximal feasible regularization (deduced using http://jmlr.org/papers/volume8/koh07a/koh07a.pdf).
class MetadataParam extends Param[Metadata]

Created by alexander.lutsenko on 20.09.16.
class ModelSummary extends Serializable

One of main extensions to the base concept of model - each model might return a summary represented by a named collection of dataframes.
trait ModelTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T]

In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg.
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg. unscale weights or remove intercept). Also useful for extending summary blocks (eg. during evaluation/cross-validation).
This interface defines the logic of model transformation.
trait ModelWithSummary[M <: ModelWithSummary[M]] extends Model[M] with MLWritable

Model which has a summary.
Model which has a summary. Includes support for reading and wirting summary blocks.
class MultiClassCombinationModel[N <: ModelWithSummary[N]] extends MultiClassCombinationModelBase[N, MultiClassCombinationModel[N]]

Combination model which evaluates ALL nested model and returns vector.
abstract class MultiClassCombinationModelBase[N <: ModelWithSummary[N], M <: MultiClassCombinationModelBase[N, M]] extends CombinedModel[N, M] with HasClassesCol with HasPredictionCol with HasFeaturesCol

Base class for models, evaluated per each class.
class MultinominalExtractor extends Estimator[MultinominalExtractorModel] with MultinominalExtractorParams with DefaultParamsWritable

Utility for converting columns with string or a set of stings into a vector of 0/1 with the cardinality equal to the number of unique string values used.
class MultinominalExtractorModel extends Model[MultinominalExtractorModel] with MultinominalExtractorParams with DefaultParamsWritable

Model produced by the multinominal extractor.
Model produced by the multinominal extractor. Knows the predefined set of values and maps strings/set of strings to vectors of 0/1 with cardinality equal to amount of known values.
trait MultinominalExtractorParams extends HasInputCol with HasOutputCol

Parameters for multinominal feature extractor.
class NaNToMeanReplacerEstimator extends Estimator[NaNToMeanReplacerModel] with NaNToMeanReplacerParams

Estimates mean values ignoring NaN's
class NaNToMeanReplacerModel extends Model[NaNToMeanReplacerModel] with NaNToMeanReplacerParams with DefaultParamsWritable

Model used to replace values with pre-computed defaults before training/predicting.
trait NaNToMeanReplacerParams extends HasInputCol with HasOutputCol

Set of parameters for the replacer
class NameAssigner extends Transformer with HasInputCols

Assuming there is a metadata attached to a integer field can be used to replace ints with corresponding attribute names.
Assuming there is a metadata attached to a integer field can be used to replace ints with corresponding attribute names. Used, for example in the validation pipeline to avoid attaching large strings to the validation results (eg. score/label descriptions) before the very end.
class NullToDefaultReplacer extends Transformer with HasColumnsSets with DefaultParamsWritable

Utility used to replace null values with defaults (zero or false).
class NullToNaNVectorAssembler extends Transformer with HasInputCols with HasOutputCol with HasColumnAttributeMap with DefaultParamsWritable

:: Experimental :: A feature transformer that merges multiple columns into a vector column.
:: Experimental :: A feature transformer that merges multiple columns into a vector column.
This class is a copy of VectorAssembler with two enhancements: support for nulls (replaced to NaNs) and pattern matching extracted from the inner loop.

Annotations
@Experimental()
class PartitionedRankingEvaluator extends Evaluator[PartitionedRankingEvaluator] with HasOutputCol with HasGroupByColumns

Evaluator used to compute metrics for predictions grouped by a certain criteria (typically by a user id).
Evaluator used to compute metrics for predictions grouped by a certain criteria (typically by a user id). Materializes all the predictions for a criteria in memory and calculates multiple metrics. Can be used only for fine-grained grouping criteria. Supports mutli-label and multi-score cross evaluation (computes metrics for each label-score combinations if provided with vectors instead of scalars).
trait PartitioningParams extends Params with HasSortByColumns

Settings for partitioning, except the number of partitions.
Settings for partitioning, except the number of partitions. Is extended by static and dynamic partitioners.
class PipelinedFeatureSelector extends GenericFeatureSelector[PipelinedFeatureSelector]
class PipelinedSignificantFeatureSelector extends SignificantFeatureSelector[PipelinedFeatureSelector]
final class QuantileDiscretizer extends Estimator[Bucketizer] with QuantileDiscretizerBase with DefaultParamsWritable with HasInputCols with HasOutputCols

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Annotations
@Since( "1.6.0" )
trait QuantileDiscretizerBase extends Params with HasHandleInvalid with HasInputCol with HasOutputCol

Params for QuantileDiscretizer.
class RegressionEvaluator extends Evaluator[RegressionEvaluator]

Simple evaluator based on the mllib.RegressionMetrics.
Simple evaluator based on the mllib.RegressionMetrics.
TODO: Add unit tests
abstract class ScalerEstimator[M <: ModelWithSummary[M]] extends Estimator[Unscaler[M]] with ScalerParams with DefaultParamsWritable

This is a specific implementation of the scaler for linear models.
This is a specific implementation of the scaler for linear models. Uses the ability to propagate scaling to the weights to avoid overhead when predicting.
trait ScalerParams extends Params with HasFeaturesCol

Scaler parameters.
class SelectingModel[N <: ModelWithSummary[N]] extends CombinedModel[N, SelectingModel[N]] with HasTypeCol with HasFeaturesCol

Selecting model applies exactly one model based on instance type and return its result.
class SeriallizableAvlTreeDigest extends Serializer[SeriallizableAvlTreeDigest] with Serializable

Serializable wrapper over the TDigest
Serializable wrapper over the TDigest

Annotations
@DefaultSerializer()
abstract class SignificantFeatureSelector[Filter <: GenericFeatureSelector[Filter]] extends Estimator[Filter] with SummarizableEstimator[Filter] with HasWeights with HasFeaturesCol with HasFeaturesSignificance
class SqlFilter extends Transformer with DefaultParamsWritable

Simple utility used to apply SQL WHERE filter
trait SummarizableEstimator[M <: ModelWithSummary[M]] extends Estimator[M]

Estimator with produces model with summary.
Estimator with produces model with summary. Used to simplify chaining.
class TopKTransformer[B] extends Transformer with DefaultParamsWritable with HasGroupByColumns

Created by eugeny.malyutin on 20.07.17.
Created by eugeny.malyutin on 20.07.17.
Performs TopK-UDAF logic without annoying schema pack-unpack
B
- raw type (Long for LongTyped-columns) for columnToOrderBy Ordering for this type should be defined
class TopKUDAF[B] extends UserDefinedAggregateFunction with Logging

Created by eugeny.malyutin on 24.06.16.
Created by eugeny.malyutin on 24.06.16.
UDAF designed to extract top-numRows rows by columnValue Used to replace Hive Window-functions which are to slow in case of all-df in one aggregation cell Result of aggFun is packed in a column "arrData" and need to be org.apache.spark.sql.functions.explode-d
B
- type of columnToSortBy with implicit ordering for type B
class UnwrappedStage[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Estimator[M] with SummarizableEstimator[M]

In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg.
In case if we can avoid certain stages used during training while predicting we need to propagate some changes to the model (eg. unscale weights or remove intercept). Also useful for extending summary blocks (eg. during evaluation/cross-validation).
This class is used as a typical pipeline stage while training (fits and applies transformer, then calls the nested estimator), but it automatically eliminates itself from the resulting model by applying model transformer.
class VectorExplode extends Transformer with DefaultParamsWritable

Utility used to extract nested values from vectors into dedicated columns.
Utility used to extract nested values from vectors into dedicated columns. Requires vector metadata and extracts names from where. Typically used as a final stage before results visualization.
class VectorStatCollector extends Transformer with HasInputCol with HasGroupByColumns with DefaultParamsWritable

Utility used to collect detailed stat for vectors grouped by a certain keys.
Utility used to collect detailed stat for vectors grouped by a certain keys. In addition to common stuff (mean, variance, min/max, norms) calculates percentiles as configured. Resulting dataframe contains only columns from the key and stat columns.
case class WeightedFeature(index: Int, name: String, weight: Double) extends Product with Serializable

Utility used for reporting single indexed feature weight.
case class WeightsStat(stats: Array[WeightsStatRecord]) extends Product with Serializable
case class WeightsStatRecord(index: Int, name: String, descriminant: String, average: Double, stdDev: Double, count: Long, significance: Double, isRelevant: Boolean) extends Product with Serializable
class XGBoostRegressor extends Estimator[XGRegressionModelWrapper] with SummarizableEstimator[XGRegressionModelWrapper] with OkXGBoostRegressorParams with HasLossHistory with HasFeaturesSignificance with DefaultParamsWritable

Light weight wrapper for DMLC xgboost4j-spark.
Light weight wrapper for DMLC xgboost4j-spark. Optimizes defaults and provides rich summary extraction.
class XGRegressionModelWrapper extends Model[XGRegressionModelWrapper] with ModelWithSummary[XGRegressionModelWrapper] with PredictorParams with OkXGBoostRegressorParams

Value Members

object AutoAssembler extends DefaultParamsReadable[AutoAssembler] with Serializable

Adds read logic
object ColumnsExtractor extends DefaultParamsReadable[ColumnsExtractor] with Serializable

Adds read ability.
object CombinedModel extends MLReadable[PipelineStage] with Serializable
object CompositScaleEstimator extends DefaultParamsReadable[CompositScaleEstimator[_, _]] with Serializable
object CrossValidator extends DefaultParamsReadable[CrossValidator[_]] with Serializable
object DSVRGD extends Serializable with HasNetlibBlas with HasLossHistory
object Evaluator extends Serializable
object ForkedEstimator extends Serializable

Helper used to inject common task support with thread count limit into all forked estimators.
object ForkedSparkEstimator extends MLReadable[ForkedSparkEstimator[_, _]] with Serializable
object ForkedSparkEstimatorApp extends App with Logging
object HasNetlibBlas extends Serializable
object Interceptor extends DefaultParamsReadable[Interceptor] with Serializable
object IsotonicRegression extends DefaultParamsReadable[IsotonicRegression] with Serializable

Annotations
@Since( "1.6.0" )
object JacksonParam extends Serializable
object LinearDSVRGD extends DefaultParamsReadable[LinearDSVRGD] with Serializable
object LinearMatrixDSVRGD extends DefaultParamsReadable[LinearMatrixDSVRGD] with Serializable
object LinearRegressionModel extends MLReadable[LinearRegressionModel] with Serializable
object LinearRegressionSGD extends DefaultParamsReadable[LinearRegressionSGD] with Serializable
object LinearScaleEstimator extends DefaultParamsReadable[LinearScaleEstimator[_]] with Serializable
object LogisticDSVRGD extends DefaultParamsReadable[LogisticDSVRGD] with Serializable
object LogisticMatrixDSVRGD extends DefaultParamsReadable[LogisticMatrixDSVRGD] with Serializable
object LogisticRegressionLBFSG extends DefaultParamsReadable[LogisticRegressionLBFSG] with Serializable
object LogisticRegressionModel extends MLReadable[LogisticRegressionModel] with Serializable
object MLWrapperModel extends MLReadable[PipelineStage] with Serializable
object MatrixLBFGS extends Logging with HasNetlibBlas with Serializable
object MatrixUtils

Created by dmitriybugaichenko on 19.11.16.
Created by dmitriybugaichenko on 19.11.16.
Utility alowing access of certain hidden methods of Spark's mllib linalg
object ModelWithSummary extends MLReadable[PipelineStage] with Serializable

Helper for reading and writing models in a typed way.
object MultinominalExtractor extends DefaultParamsReadable[MultinominalExtractor] with Serializable

Adds read logic
object MultinominalExtractorModel extends DefaultParamsReadable[MultinominalExtractorModel] with Serializable

Adds read ability
object NaNToMeanReplacerModel extends DefaultParamsReadable[NaNToMeanReplacerModel] with Serializable

Adds support for reading.
object NullToDefaultReplacer extends DefaultParamsReadable[NullToDefaultReplacer] with Serializable

Adds read ability.
object NullToNaNVectorAssembler extends DefaultParamsReadable[NullToNaNVectorAssembler] with Serializable

Annotations
@Since( "1.6.0" )
object PartitionedRankingEvaluator extends Serializable
object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging with Serializable

Annotations
@Since( "1.6.0" )
object Scaler extends Serializable
object SignificantFeatureSelector extends Serializable with Logging
object SqlFilter extends DefaultParamsReadable[SqlFilter] with Serializable
object UnwrappedStage extends Serializable
object XGBoostRegressor extends DefaultParamsReadable[XGBoostRegressor] with Serializable
object XGRegressionModelWrapper extends MLReadable[XGRegressionModelWrapper] with Serializable
package hyperopt
package texts

package odkl

Type Members

class AutoAssembler extends Estimator[PipelineModel] with AutoAssemblerParams with DefaultParamsWritable

trait AutoAssemblerParams extends HasColumnsSets with HasOutputCol with HasColumnAttributeMap

class BinaryClassificationEvaluator extends Evaluator[BinaryClassificationEvaluator]

class CRRSamplerEstimator extends Estimator[CRRSamplerModel] with DefaultParamsWritable with CRRSamplerParams

class CRRSamplerModel extends Model[CRRSamplerModel] with DefaultParamsWritable with CRRSamplerParams with HasNetlibBlas

trait CRRSamplerParams extends HasInputCol with HasGroupByColumns with HasLabelCol

case class ClassPathExpression(filter: String, replacements: Array[(String, String)]) extends Product with Serializable

class ColumnsExtractor extends Transformer with DefaultParamsWritable

class CombinedLinearModelUnwrappedFeatureSelector[M <: LinearModel[M], C <: CombinedModel[M, C]] extends GenericFeatureSelector[CombinedLinearModelUnwrappedFeatureSelector[M, C]] with ModelTransformer[C, CombinedLinearModelUnwrappedFeatureSelector[M, C]]

abstract class CombinedModel[M <: ModelWithSummary[M], C <: CombinedModel[M, C]] extends Model[C] with ModelWithSummary[C] with HasDescriminantColumn with HasDirectTransformOption with HasPredictionCol with ForkedModelParams

class CompositScaleEstimator[M <: LinearModel[M], C <: CombinedModel[M, C]] extends ScalerEstimator[C]

class CrossValidator[M <: ModelWithSummary[M]] extends ForkedEstimatorSameType[M, Int] with HasIsTestCol with HasFolds with MetricsExtractor with HasMetricsBlock

abstract class DSVRGD[M <: ModelWithSummary[M]] extends Estimator[M] with SummarizableEstimator[M] with HasPredictionCol with HasFeaturesCol with HasLabelCol with HasRegParam with HasElasticNetParam with HasNetlibBlas with HasMaxIter with HasTol with HasCacheTrainData

abstract class DeVectorizedDSVRGD[M <: ModelWithSummary[M]] extends DSVRGD[M]

abstract class Evaluator[S <: Evaluator[S]] extends Transformer with HasLabelCol with HasPredictionCol

class ExponentialVectorDiscountTransformer extends Transformer with DefaultParamsWritable with HasGroupByColumns

class ExtendedMultivariateOnlineSummarizer extends MultivariateOnlineSummarizer with Serializable with Logging

class FoldedFeaturesStatsAggregator[SelectingModel <: ModelWithSummary[SelectingModel] with HasWeights] extends Transformer with HasFeaturesSignificance with HasWeights with HasFeaturesCol

trait ForkSource[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType, ModelOut <: ModelWithSummary[ModelOut]] extends AnyRef

abstract class ForkedEstimator[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType, ModelOut <: ModelWithSummary[ModelOut]] extends Estimator[ModelOut] with SummarizableEstimator[ModelOut] with ForkedModelParams with HasNumThreads

abstract class ForkedEstimatorSameType[ModelIn <: ModelWithSummary[ModelIn], ForeKeyType] extends ForkedEstimator[ModelIn, ForeKeyType, ModelIn]

trait ForkedModelParams extends AnyRef

class ForkedSparkEstimator[M <: ModelWithSummary[M] with MLWritable, E <: SummarizableEstimator[M] with MLWritable] extends Estimator[M] with SummarizableEstimator[M] with MLWritable

abstract class GenericFeatureSelector[M <: ModelWithSummary[M]] extends Model[M] with ModelWithSummary[M] with HasFeaturesCol

trait HasBatchSize extends Params

trait HasCacheTrainData extends AnyRef

trait HasClassesCol extends Params

trait HasClassesWeights extends Params

trait HasColumnAttributeMap extends AnyRef

trait HasColumnsSets extends Params

trait HasDescriminantColumn extends Params

trait HasDirectTransformOption extends Transformer

trait HasFeaturesSignificance extends AnyRef

trait HasFolds extends Params

trait HasGroupByColumns extends AnyRef

trait HasIsTestCol extends Params

trait HasLossHistory extends AnyRef

trait HasMetricsBlock extends AnyRef

trait HasNetlibBlas extends AnyRef

trait HasNumThreads extends AnyRef

trait HasRegularizeLast extends Params

trait HasSortByColumns extends AnyRef

trait HasTypeCol extends Params

trait HasWeights extends AnyRef

class Interceptor extends Transformer with HasFeaturesCol with DefaultParamsWritable

class IsotonicRegression extends regression.IsotonicRegression

class JacksonParam[T] extends Param[T] with Logging

class LinearCombinationModel[N <: ModelWithSummary[N]] extends MultiClassCombinationModelBase[N, LinearCombinationModel[N]]

class LinearDSVRGD extends DeVectorizedDSVRGD[LinearRegressionModel]

abstract class LinearEstimator[M <: LinearModel[M], T <: LinearEstimator[M, T]] extends Predictor[Vector, T, M] with SummarizableEstimator[M] with LinearModelParams with HasWeights

class LinearMatrixDSVRGD extends DSVRGD[LinearCombinationModel[LinearRegressionModel]]

abstract class LinearModel[M <: LinearModel[M]] extends PredictionModel[Vector, M] with DirectPredictionModel[Vector, M] with ModelWithSummary[M] with LinearModelParams with HasWeights

trait LinearModelParams extends PredictorParams

class LinearModelSignificantFeatureSelector[ResultModel <: LinearModel[ResultModel]] extends SignificantFeatureSelector[LinearModelUnwrappedFeatureSelector[ResultModel]]

class LinearModelUnwrappedFeatureSelector[M <: LinearModel[M]] extends GenericFeatureSelector[LinearModelUnwrappedFeatureSelector[M]] with ModelTransformer[M, LinearModelUnwrappedFeatureSelector[M]]

class LinearRegressionModel extends LinearModel[LinearRegressionModel]

class LinearRegressionSGD extends LinearRegressor[LinearRegressionModel, GradientDescent, LinearRegressionSGD] with HasRegParam with HasTol with HasMaxIter with HasStepSize

abstract class LinearRegressor[M <: LinearModel[M], O <: Optimizer, T <: LinearRegressor[M, O, T]] extends LinearEstimator[M, T] with DefaultParamsWritable with HasCacheTrainData

class LinearScaleEstimator[M <: LinearModel[M]] extends ScalerEstimator[M]

class LogisticDSVRGD extends DeVectorizedDSVRGD[LogisticRegressionModel]

class LogisticMatrixDSVRGD extends DSVRGD[LinearCombinationModel[LogisticRegressionModel]]

class LogisticRegressionLBFSG extends LinearRegressor[LogisticRegressionModel, LogisticRegressionLBFSG, LogisticRegressionLBFSG] with HasRegParam with HasTol with HasMaxIter with Optimizer with HasElasticNetParam with HasRegularizeLast with HasBatchSize

class LogisticRegressionModel extends LinearModel[LogisticRegressionModel]

class MLWrapper[M <: Model[M]] extends Estimator[MLWrapperModel[M]] with SummarizableEstimator[MLWrapperModel[M]]

class MLWrapperModel[M <: Model[M]] extends Model[MLWrapperModel[M]] with ModelWithSummary[MLWrapperModel[M]]

class MatrixLBFGS extends Estimator[LinearCombinationModel[LogisticRegressionModel]] with SummarizableEstimator[LinearCombinationModel[LogisticRegressionModel]] with PredictorParams with HasTol with HasMaxIter with HasRegParam with HasRegularizeLast with HasBatchSize with HasNetlibBlas

class MetadataParam extends Param[Metadata]

class ModelSummary extends Serializable

trait ModelTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T]

trait ModelWithSummary[M <: ModelWithSummary[M]] extends Model[M] with MLWritable

class MultiClassCombinationModel[N <: ModelWithSummary[N]] extends MultiClassCombinationModelBase[N, MultiClassCombinationModel[N]]

abstract class MultiClassCombinationModelBase[N <: ModelWithSummary[N], M <: MultiClassCombinationModelBase[N, M]] extends CombinedModel[N, M] with HasClassesCol with HasPredictionCol with HasFeaturesCol

class MultinominalExtractor extends Estimator[MultinominalExtractorModel] with MultinominalExtractorParams with DefaultParamsWritable

class MultinominalExtractorModel extends Model[MultinominalExtractorModel] with MultinominalExtractorParams with DefaultParamsWritable

trait MultinominalExtractorParams extends HasInputCol with HasOutputCol

class NaNToMeanReplacerEstimator extends Estimator[NaNToMeanReplacerModel] with NaNToMeanReplacerParams

class NaNToMeanReplacerModel extends Model[NaNToMeanReplacerModel] with NaNToMeanReplacerParams with DefaultParamsWritable