OpStatistics

Type Members

case class ChiSquaredResults(cramersV: Double, chiSquaredStat: Double, pValue: Double) extends Product with Serializable

Case class for holding results of the Chi-squared statistical test we use for calculating Cramer's V
Case class for holding results of the Chi-squared statistical test we use for calculating Cramer's V
cramersV
Cramer's V value
chiSquaredStat
Actual Chi-squared statistic
pValue
P-value
case class ConfidenceResults(maxConfidences: Array[Double], supports: Array[Double]) extends Product with Serializable

Container for association rule confidence and supports
Container for association rule confidence and supports
maxConfidences
Array of maximum confidence values, one per contingency matrix row
supports
Array of support values for each categorical value, one per contingency matrix row
case class ContingencyStats(chiSquaredResults: ChiSquaredResults, pointwiseMutualInfo: Type, contingencyMatrix: Type, mutualInfo: Double, confidenceResults: ConfidenceResults) extends Product with Serializable

Container class for statistics calculated from contingency matrices constructed from categorical variables
Container class for statistics calculated from contingency matrices constructed from categorical variables
chiSquaredResults
Chi-squared test results for the given contingency matrix
pointwiseMutualInfo
Map between feature name in feature vector and map of pointwise mutual information values between that feature and all values the label can take
contingencyMatrix
Actual (unfiltered) contingency matrix that the rest of the results are calculated from
mutualInfo
Map between feature name in feature vector and the mutual information with the label
confidenceResults
Association rule details (confidences + supports)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
object LabelWiseValues

Two-element result tuple containing a map of labels to values which is used for eg.
Two-element result tuple containing a map of labels to values which is used for eg. pointwise mutual information or the contingency matrix itself.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def computeCorrelationsWithLabel(featuresAndLabel: RDD[Vector], colStats: MultivariateStatisticalSummary, numOfRows: Long): Array[Double]

Assumes that we have already computed a MultivariateStatisticsSummary on the RDD, so we can use that info here.
Assumes that we have already computed a MultivariateStatisticsSummary on the RDD, so we can use that info here. This defines an RDD aggregation that calculates all the correlations with the label. Data is assumed to be laid out in an RDD[org.apache.spark.mllib.linalg.Vector] where the label is the last element.
featuresAndLabel
Input RDD consisting of a single array containing the feature vector with the label as the last element
returns
Array of correlations of each feature vector element with the label
def contingencyStats(contingency: Matrix): ContingencyStats

Calculates all of the statistics we use that come from contingency matrices between categorical features and categorical labels and stores them in a ContingencyStats case class.
Calculates all of the statistics we use that come from contingency matrices between categorical features and categorical labels and stores them in a ContingencyStats case class.
contingency
Matrix of co-occurrences of feature values with label values. Each row represents a different feature choice, while each column represents a different label value.
returns
ContingencyStats object containing all the statistics we calculate from contingency matrices
def contingencyStatsFromMultiPickList(contingency: Matrix, labelCounts: Array[Double]): ContingencyStats

Same as contingencyStats method, but specialized to MultiPickLists.
Same as contingencyStats method, but specialized to MultiPickLists. The standard contingency table stats are not technically valid for MultiPickLists because the choices are not independent from each other (multipicklists are multi-hot encoded instead of one-hot encoded).
There are several strategies to deal with this to calculate statistics similar to Cramer's V. We follow https://cran.r-project.org/web/packages/MRCV/vignettes/MRCV-vignette.pdf for inspiration, but use a slightly different scheme where we compute stats from a 2 x numLabels contingency matrix for each choice separately, and take the max of these Cramer's V values (one per choice) as the Cramer's V value for the entire MultiPickList. See BadFeatureZooTest for testing how this performs on different types of relations between MultiPickLists and the label.
contingency
Matrix of co-occurrences of feature values with label values. Each row represents a different feature choice, while each column represents a different label value.
labelCounts
Array of counts of each label, used to construct the 2 x numLabels contingency matrices for each choice
returns
ContingencyStats object containing all the statistics we calculate from contingency matrices
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package stats

object OpStatistics

Type Members

case class ChiSquaredResults(cramersV: Double, chiSquaredStat: Double, pValue: Double) extends Product with Serializable

case class ConfidenceResults(maxConfidences: Array[Double], supports: Array[Double]) extends Product with Serializable

case class ContingencyStats(chiSquaredResults: ChiSquaredResults, pointwiseMutualInfo: Type, contingencyMatrix: Type, mutualInfo: Double, confidenceResults: ConfidenceResults) extends Product with Serializable

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

object LabelWiseValues

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def computeCorrelationsWithLabel(featuresAndLabel: RDD[Vector], colStats: MultivariateStatisticalSummary, numOfRows: Long): Array[Double]

def contingencyStats(contingency: Matrix): ContingencyStats

def contingencyStatsFromMultiPickList(contingency: Matrix, labelCounts: Array[Double]): ContingencyStats

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped