Utility used to inject caching.
Collects all summary blocks and materializes them as into a single partition.
Collects all summary blocks and materializes them as into a single partition. Then saves it to parquet in order not to waste memory.
Collects all summary blocks and materializes them as into a single partition.
For training a model on data set of uncertain size ads an ability to downsample it to a pre-defined size (approximatelly).
In case if number of partitions is not known upfront, you can use dynamic partitioner to split into partitions of predefined size (approximatelly).
Data transformer which does nothing :)
Model transformer applying transformation only to data, keeping the model unchanged.
Utility simplifying transformations when only model transformation is required.
Utility simplifying creation of predefined model transformer (when no fitting required).
Keeps data based one the some ordered constraint.
For training a model on data set of uncertain size ads an ability to take only the "most recent" records.
For training a model on data set of uncertain size ads an ability to take only the "most recent" records. Estimates the size of the dataset and calculates approximate bounds for filtering.
Data transformer which adds partitioning.
Utility used to persist portion of data into temporary storage.
Utility used to persist portion of data into temporary storage. Usefull for grounding execution plans and avoid massive "skips". Unlike chekpointing is more explicit and controllable.
Utility simplifying transformations when data transformation is provided externally.
Data transformer for projecting.
Parameters for sampling
Data transformer which takes sample of the data.
Data transformer which takes sample of the data. Resulting dataframe is constructed in a way that results are non-determenistic and might vary from run to run (unless the seed is specified or with replacement enabled - in these cases we fallback to default data set sampling which is determenistic).
Cache data before passing to estimator (won't be cached in resulting prediction model).
Cache data before passing to estimator (won't be cached in resulting prediction model).
Cache data before passing to estimator (won't be cached in resulting prediction model).
Cache data before passing to estimator (won't be cached in resulting prediction model). Forces cache materialization by calling count.
Collect all summary blocks to driver and add re-create dataframe with a single block.
Collect all summary blocks to driver and add re-create dataframe with a single block. Usefull to reduce number of partitions and tasks for the final persist.
Estimator to wrap summary blocks for.
Final model is the same, but summary blocks are collected and re-created.
Saves summary blocks to parquet files add re-create dataframe.
Saves summary blocks to parquet files add re-create dataframe. Usefull to reduce memory footprint for tasks with large summary (eg. cross-validation output).
Estimator to wrap summary blocks for.
Where to save parquet files
Final model is the same, but summary blocks are written as one partition parquet files and re-created.
Adds a stage with data-only transformation (eg.
Adds a stage with data-only transformation (eg. assigning folds).
Adds a stage with data-only transformation (eg.
Adds a stage with data-only transformation (eg. assigning folds).
Adds a stage with model only transformation (eg.
Adds a stage with model only transformation (eg. evaluation)
Stores data into temporary path.
Stores data into temporary path. Usefull for "grounding" data and avoiding large execution plans.
Keeps only predefined set of columns in the dataset before passing to estimator.
Keeps only predefined set of columns in the dataset before passing to estimator. Usefull in combination with caching to reduce memory footprint. Projection will not appear in the resulting prediction model.
Estimator to cal after projecting.
Columns to keep.
Exactly the same model as produced by the estimator.
Removes predefined set of columns in the dataset before passing to estimator.
Removes predefined set of columns in the dataset before passing to estimator. Usefull in combination with caching to reduce memory footprint. Projection will not appear in the resulting prediction model.
Estimator to cal after projecting.
Columns to remove.
Exactly the same model as produced by the estimator.
Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
Estimator to add partitioning to.
Number of partitions.
Columns to partition by.
Exactly the same model as produced by the estimator.
Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
Estimator to add partitioning to.
Number of partitions.
Exactly the same model as produced by the estimator.
Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
Defines the logic of partitioning.
Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
Estimator to add partitioning to.
Number of partitions.
Columns to partition by.
Columns to sort data in partitions. Note that partitionBy are not added to this set by default.
Exactly the same model as produced by the estimator.
Adds a stage for sampling data from the dataset.
Adds a stage for sampling data from the dataset. Behavior is deterministic (iteration always produce the same result) if withReplacement OR seed specified, otherwise the behavior is non-determenistic and subsequent iterations migth see different samples.
Estimator to sample data for.
Expected number of records to sample
Whenever to simulate replacement (single item might be selected multiple times)
Seed for the random number generation.
Estimator with samples data before passing to nested estimator.
Adds a stage with data downstream transformation and model upstream transformation.