StatsGenerator

object StatsGenerator

Module managing FeatureStats Schema, Aggregations to be used by type and aggregator construction.

Stats Aggregation has an offline/ batch component and an online component. The metrics defined for stats depend on the schema of the join. The dataTypes and column names. For the online side, we obtain this information from the JoinCodec/valueSchema For the offline side, we obtain this information directly from the outputTable. To keep the schemas consistent we sort the metrics in the schema by name. (one column can have multiple metrics).

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

StatsGenerator
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Type Members

case class MetricTransform(name: String, expression: InputTransform, operation: Operation, suffix: String = "", argMap: Map[String, String] = null) extends Product with Serializable
MetricTransform represents a single statistic built on top of an input column.

Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def PSIKllSketch(reference: AnyRef, comparison: AnyRef, bins: Int = 128, eps: Double = 0.000001): AnyRef
PSI is a measure of the difference between two probability distributions.
PSI is a measure of the difference between two probability distributions. However, it's not defined for cases where a bin can have zero elements in either distribution (meant for continuous measures). In order to support PSI for discrete measures we add a small eps value to perturb the distribution in bins.
Existing rules of thumb are: PSI < 0.10 means "little shift", .10<PSI<.25 means "moderate shift", and PSI>0.25 means "significant shift, action required" https://scholarworks.wmich.edu/dissertations/3208
def SeriesFinalizer(key: String, value: AnyRef): AnyRef
Post processing for finalized values or IRs when generating a time series of stats.
Post processing for finalized values or IRs when generating a time series of stats. In the case of percentiles for examples we reduce to 5 values in order to generate candlesticks.
def anyTransforms(column: String): Seq[MetricTransform]
Stats applied to any column
final def asInstanceOf[T0]: T0
Definition Classes
Any
def buildAggPart(m: MetricTransform): AggregationPart
def buildAggregator(metrics: Seq[MetricTransform], selectedSchema: StructType): RowAggregator
Build RowAggregator to use for computing stats on a dataframe based on metrics
def buildMetrics(fields: Seq[(String, DataType)]): Seq[MetricTransform]
For the schema of the data define metrics to be aggregated
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @native()
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable])
val finalizedPercentilesMerged: Array[Double]
val finalizedPercentilesSeries: Array[Double]
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@native()
val ignoreColumns: Seq[String]
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def lInfKllSketch(sketch1: AnyRef, sketch2: AnyRef, bins: Int = 128): AnyRef
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@native()
val nullRateSuffix: String
val nullSuffix: String
def numericTransforms(column: String): Seq[MetricTransform]
Stats applied to numeric columns
def regularize(doubles: Array[Double], eps: Double): Array[Double]
Given a PMF add and substract small values to keep a valid probability distribution without zeros
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
val totalColumn: String
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
object InputTransform extends Enumeration
InputTransform acts as a signal of how to process the metric.
InputTransform acts as a signal of how to process the metric.
IsNull: Check if the input is null.
Raw: Operate in the input column.
One: lit(true) in spark. Used for row counts leveraged to obtain null rate values.

Packages

StatsGenerator

object StatsGenerator

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

StatsGenerator

object StatsGenerator

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

StatsGenerator