UnwrappedStage

Type Members

class CachingTransformer[M <: ModelWithSummary[M]] extends Model[CachingTransformer[M]] with ModelTransformer[M, CachingTransformer[M]]

Utility used to inject caching.
class CollectSummaryToParquetTransformer[M <: ModelWithSummary[M]] extends ModelOnlyTransformer[M, CollectSummaryToParquetTransformer[M]]

Collects all summary blocks and materializes them as into a single partition.
Collects all summary blocks and materializes them as into a single partition. Then saves it to parquet in order not to waste memory.
class CollectSummaryTransformer[M <: ModelWithSummary[M]] extends ModelOnlyTransformer[M, CollectSummaryTransformer[M]]

Collects all summary blocks and materializes them as into a single partition.
class DynamicDataTransformerTrainer[M <: ModelWithSummary[M]] extends Estimator[IdentityModelTransformer[M]] with DefaultParamsWritable with PartitioningParams
class DynamicDownsamplerTrainer extends Estimator[SamplingTransformer] with SamplerParams

For training a model on data set of uncertain size ads an ability to downsample it to a pre-defined size (approximatelly).
class DynamicPartitionerTrainer[M <: ModelWithSummary[M]] extends Estimator[IdentityModelTransformer[M]] with DefaultParamsWritable with PartitioningParams

In case if number of partitions is not known upfront, you can use dynamic partitioner to split into partitions of predefined size (approximatelly).
class IdentityDataTransformer extends Transformer

Data transformer which does nothing :)
class IdentityModelTransformer[M <: ModelWithSummary[M]] extends PredefinedDataTransformer[M, IdentityModelTransformer[M]]

Model transformer applying transformation only to data, keeping the model unchanged.
abstract class ModelOnlyTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T] with ModelTransformer[M, T] with DefaultParamsWritable

Utility simplifying transformations when only model transformation is required.
class NoTrainEstimator[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Estimator[T] with DefaultParamsWritable

Utility simplifying creation of predefined model transformer (when no fitting required).
class OrderedCut extends Model[OrderedCut] with HasGroupByColumns

Keeps data based one the some ordered constraint.
class OrderedCutEstimator extends Estimator[OrderedCut] with HasGroupByColumns

For training a model on data set of uncertain size ads an ability to take only the "most recent" records.
For training a model on data set of uncertain size ads an ability to take only the "most recent" records. Estimates the size of the dataset and calculates approximate bounds for filtering.
class PartitioningTransformer extends Transformer with PartitioningParams

Data transformer which adds partitioning.
class PersistingTransformer[M <: ModelWithSummary[M]] extends Model[PersistingTransformer[M]] with ModelTransformer[M, PersistingTransformer[M]]

Utility used to persist portion of data into temporary storage.
Utility used to persist portion of data into temporary storage. Usefull for grounding execution plans and avoid massive "skips". Unlike chekpointing is more explicit and controllable.
abstract class PredefinedDataTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T] with ModelTransformer[M, T] with DefaultParamsWritable

Utility simplifying transformations when data transformation is provided externally.
class ProjectingTransformer extends Transformer

Data transformer for projecting.
trait SamplerParams extends HasSeed with DefaultParamsWritable

Parameters for sampling
class SamplingTransformer extends Model[SamplingTransformer] with SamplerParams

Data transformer which takes sample of the data.
Data transformer which takes sample of the data. Resulting dataframe is constructed in a way that results are non-determenistic and might vary from run to run (unless the seed is specified or with replacement enabled - in these cases we fallback to default data set sampling which is determenistic).

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cache[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], cacher: CachingTransformer[M]): UnwrappedStage[M, CachingTransformer[M]]

Cache data before passing to estimator (won't be cached in resulting prediction model).
def cache[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): UnwrappedStage[M, CachingTransformer[M]]

Cache data before passing to estimator (won't be cached in resulting prediction model).
def cacheAndMaterialize[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): UnwrappedStage[M, CachingTransformer[M]]

Cache data before passing to estimator (won't be cached in resulting prediction model).
Cache data before passing to estimator (won't be cached in resulting prediction model). Forces cache materialization by calling count.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def collectSummary[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M]): UnwrappedStage[M, CollectSummaryTransformer[M]]

Collect all summary blocks to driver and add re-create dataframe with a single block.
Collect all summary blocks to driver and add re-create dataframe with a single block. Usefull to reduce number of partitions and tasks for the final persist.
estimator
Estimator to wrap summary blocks for.
returns
Final model is the same, but summary blocks are collected and re-created.
def collectSummaryToParquet[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], path: String): UnwrappedStage[M, CollectSummaryToParquetTransformer[M]]

Saves summary blocks to parquet files add re-create dataframe.
Saves summary blocks to parquet files add re-create dataframe. Usefull to reduce memory footprint for tasks with large summary (eg. cross-validation output).
estimator
Estimator to wrap summary blocks for.
path
Where to save parquet files
returns
Final model is the same, but summary blocks are written as one partition parquet files and re-created.
def dataOnly[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], dataTransformer: Transformer): UnwrappedStage[M, IdentityModelTransformer[M]]

Adds a stage with data-only transformation (eg.
Adds a stage with data-only transformation (eg. assigning folds).
def dataOnlyWithTraining[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], dataTransformerFitter: Estimator[_]): UnwrappedStage[M, IdentityModelTransformer[M]]

Adds a stage with data-only transformation (eg.
Adds a stage with data-only transformation (eg. assigning folds).
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def modelOnly[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]](estimator: SummarizableEstimator[M], modelTransformer: T): UnwrappedStage[M, T]

Adds a stage with model only transformation (eg.
Adds a stage with model only transformation (eg. evaluation)
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def persistToTemp[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], tempPath: String, uncacheInput: Boolean = false, partitionBy: Array[String] = Array()): UnwrappedStage[M, PersistingTransformer[M]]

Stores data into temporary path.
Stores data into temporary path. Usefull for "grounding" data and avoiding large execution plans.
def project[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], columns: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

Keeps only predefined set of columns in the dataset before passing to estimator.
Keeps only predefined set of columns in the dataset before passing to estimator. Usefull in combination with caching to reduce memory footprint. Projection will not appear in the resulting prediction model.
estimator
Estimator to cal after projecting.
columns
Columns to keep.
returns
Exactly the same model as produced by the estimator.
def projectInverse[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], columns: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

Removes predefined set of columns in the dataset before passing to estimator.
Removes predefined set of columns in the dataset before passing to estimator. Usefull in combination with caching to reduce memory footprint. Projection will not appear in the resulting prediction model.
estimator
Estimator to cal after projecting.
columns
Columns to remove.
returns
Exactly the same model as produced by the estimator.
def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int, partitionBy: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
estimator
Estimator to add partitioning to.
numPartitions
Number of partitions.
partitionBy
Columns to partition by.
returns
Exactly the same model as produced by the estimator.
def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int): UnwrappedStage[M, IdentityModelTransformer[M]]

Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
estimator
Estimator to add partitioning to.
numPartitions
Number of partitions.
returns
Exactly the same model as produced by the estimator.
def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], partitioner: PartitioningTransformer): UnwrappedStage[M, IdentityModelTransformer[M]]

Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
partitioner
Defines the logic of partitioning.
def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int, partitionBy: Seq[String], sortBy: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

Repartition the data before passing to estimator.
Repartition the data before passing to estimator. Reparitioning will not apear in the resulting prediction model.
estimator
Estimator to add partitioning to.
numPartitions
Number of partitions.
partitionBy
Columns to partition by.
sortBy
Columns to sort data in partitions. Note that partitionBy are not added to this set by default.
returns
Exactly the same model as produced by the estimator.
def sample[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numRecords: Int, withReplacement: Boolean = false, seed: Option[Long] = None): UnwrappedStage[M, IdentityModelTransformer[M]]

Adds a stage for sampling data from the dataset.
Adds a stage for sampling data from the dataset. Behavior is deterministic (iteration always produce the same result) if withReplacement OR seed specified, otherwise the behavior is non-determenistic and subsequent iterations migth see different samples.
estimator
Estimator to sample data for.
numRecords
Expected number of records to sample
withReplacement
Whenever to simulate replacement (single item might be selected multiple times)
seed
Seed for the random number generation.
returns
Estimator with samples data before passing to nested estimator.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def wrap[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]](estimator: SummarizableEstimator[M], unwrapableEstimator: Estimator[T]): UnwrappedStage[M, T]

Adds a stage with data downstream transformation and model upstream transformation.

Related Docs: class UnwrappedStage | package odkl

object UnwrappedStage extends Serializable

Type Members

class CachingTransformer[M <: ModelWithSummary[M]] extends Model[CachingTransformer[M]] with ModelTransformer[M, CachingTransformer[M]]

class CollectSummaryToParquetTransformer[M <: ModelWithSummary[M]] extends ModelOnlyTransformer[M, CollectSummaryToParquetTransformer[M]]

class CollectSummaryTransformer[M <: ModelWithSummary[M]] extends ModelOnlyTransformer[M, CollectSummaryTransformer[M]]

class DynamicDataTransformerTrainer[M <: ModelWithSummary[M]] extends Estimator[IdentityModelTransformer[M]] with DefaultParamsWritable with PartitioningParams

class DynamicDownsamplerTrainer extends Estimator[SamplingTransformer] with SamplerParams

class DynamicPartitionerTrainer[M <: ModelWithSummary[M]] extends Estimator[IdentityModelTransformer[M]] with DefaultParamsWritable with PartitioningParams

class IdentityDataTransformer extends Transformer

class IdentityModelTransformer[M <: ModelWithSummary[M]] extends PredefinedDataTransformer[M, IdentityModelTransformer[M]]

abstract class ModelOnlyTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T] with ModelTransformer[M, T] with DefaultParamsWritable

class NoTrainEstimator[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Estimator[T] with DefaultParamsWritable

class OrderedCut extends Model[OrderedCut] with HasGroupByColumns

class OrderedCutEstimator extends Estimator[OrderedCut] with HasGroupByColumns

class PartitioningTransformer extends Transformer with PartitioningParams

class PersistingTransformer[M <: ModelWithSummary[M]] extends Model[PersistingTransformer[M]] with ModelTransformer[M, PersistingTransformer[M]]

abstract class PredefinedDataTransformer[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]] extends Model[T] with ModelTransformer[M, T] with DefaultParamsWritable

class ProjectingTransformer extends Transformer

trait SamplerParams extends HasSeed with DefaultParamsWritable

class SamplingTransformer extends Model[SamplingTransformer] with SamplerParams

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def cache[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], cacher: CachingTransformer[M]): UnwrappedStage[M, CachingTransformer[M]]

def cache[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): UnwrappedStage[M, CachingTransformer[M]]

def cacheAndMaterialize[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): UnwrappedStage[M, CachingTransformer[M]]

def clone(): AnyRef

def collectSummary[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M]): UnwrappedStage[M, CollectSummaryTransformer[M]]

def collectSummaryToParquet[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], path: String): UnwrappedStage[M, CollectSummaryToParquetTransformer[M]]

def dataOnly[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], dataTransformer: Transformer): UnwrappedStage[M, IdentityModelTransformer[M]]

def dataOnlyWithTraining[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], dataTransformerFitter: Estimator[_]): UnwrappedStage[M, IdentityModelTransformer[M]]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def modelOnly[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]](estimator: SummarizableEstimator[M], modelTransformer: T): UnwrappedStage[M, T]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def persistToTemp[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], tempPath: String, uncacheInput: Boolean = false, partitionBy: Array[String] = Array()): UnwrappedStage[M, PersistingTransformer[M]]

def project[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], columns: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

def projectInverse[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], columns: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int, partitionBy: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int): UnwrappedStage[M, IdentityModelTransformer[M]]

def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], partitioner: PartitioningTransformer): UnwrappedStage[M, IdentityModelTransformer[M]]

def repartition[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numPartitions: Int, partitionBy: Seq[String], sortBy: Seq[String]): UnwrappedStage[M, IdentityModelTransformer[M]]

def sample[M <: ModelWithSummary[M]](estimator: SummarizableEstimator[M], numRecords: Int, withReplacement: Boolean = false, seed: Option[Long] = None): UnwrappedStage[M, IdentityModelTransformer[M]]

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def wrap[M <: ModelWithSummary[M], T <: ModelTransformer[M, T]](estimator: SummarizableEstimator[M], unwrapableEstimator: Estimator[T]): UnwrappedStage[M, T]

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped