DataFrameStatFunctions

Instance Constructors

new DataFrameStatFunctions(underlying: org.apache.spark.sql.DataFrameStatFunctions)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
col
the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
numBits
expected number of bits of the filter.

Since
2.0.0
def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
colName
name of the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
numBits
expected number of bits of the filter.

Since
2.0.0
def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
col
the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
fpp
expected false positive probability of the filter.

Since
2.0.0
def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
colName
name of the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
fpp
expected false positive probability of the filter.

Since
2.0.0
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def corr(col1: String, col2: String): TryAnalysis[Double]

Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
col1
the name of the column
col2
the name of the column to calculate the correlation against
returns
The Pearson Correlation Coefficient as a Double.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2", "pearson")
res1: Double = 0.613...
```
Since
1.4.0
def corr(col1: String, col2: String, method: String): TryAnalysis[Double]

Calculates the correlation of two columns of a DataFrame.
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
col1
the name of the column
col2
the name of the column to calculate the correlation against
returns
The Pearson Correlation Coefficient as a Double.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2")
res1: Double = 0.613...
```
Since
1.4.0
def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
col
the column over which the sketch is built
eps
relative error of the sketch
confidence
confidence of the sketch
seed
random seed
returns
a CountMinSketch over column colName

Since
2.0.0
def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
col
the column over which the sketch is built
depth
depth of the sketch
width
width of the sketch
seed
random seed
returns
a CountMinSketch over column colName

Since
2.0.0
def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
colName
name of the column over which the sketch is built
eps
relative error of the sketch
confidence
confidence of the sketch
seed
random seed
returns
a CountMinSketch over column colName

Since
2.0.0
def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
colName
name of the column over which the sketch is built
depth
depth of the sketch
width
width of the sketch
seed
random seed
returns
a CountMinSketch over column colName

Since
2.0.0
def cov(col1: String, col2: String): TryAnalysis[Double]

Calculate the sample covariance of two numerical columns of a DataFrame.
Calculate the sample covariance of two numerical columns of a DataFrame.
col1
the name of the first column
col2
the name of the second column
returns
the covariance of the two columns.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.cov("rand1", "rand2")
res1: Double = 0.065...
```
Since
1.4.0
def crosstab(col1: String, col2: String): TryAnalysis[DataFrame]

Computes a pair-wise frequency table of the given columns.
Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
col1
The name of the first column. Distinct items will make the first item of each row.
col2
The name of the second column. Distinct items will make the column names of the DataFrame.
returns
A DataFrame containing for the contingency table.
```
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
  .toDF("key", "value")
val ct = df.stat.crosstab("key", "value")
ct.show()
+---------+---+---+---+
|key_value|  1|  2|  3|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
```
Since
1.4.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def freqItems(cols: Seq[String]): TryAnalysis[DataFrame]

Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="http://dx.doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou. Uses a `default` support of 1%. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting `DataFrame`.
cols
the names of the columns to search frequent items in.
returns
A Local DataFrame with the Array of frequent items for each column.

Since
1.4.0

def freqItems(cols: Seq[String], support: Double): TryAnalysis[DataFrame]

Finding frequent items for columns, possibly with false positives.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="http://dx.doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting `DataFrame`.

cols

the names of the columns to search frequent items in.

returns

A Local DataFrame with the Array of frequent items for each column.

val rows = Seq.tabulate(100) { i =>
  if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = spark.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems|  b_freqItems|
+-----------+-------------+
|    [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
+----------+
|   freq_ab|
+----------+
|  [1,-1.0]|
|   ...    |
+----------+

Since: 1.4.0

def get[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): U

Applies an action to the underlying DataFrameStatFunctions.
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): TryAnalysis[U]

Applies an action to the underlying DataFrameStatFunctions, it is used for transformations that can fail due to an AnalysisException.
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): TryAnalysis[DataFrame]

Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
T
stratum type
col
column that defines strata
fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed
random seed
returns
a new DataFrame that represents the stratified sample
```
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
  (3, 3))).toDF("key", "value")
val fractions = Map(1 -> 1.0, 3 -> 0.5)
df.stat.sampleBy("key", fractions, 36L).show()
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  3|    2|
+---+-----+
```
Since
1.5.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def transformation(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): DataFrameStatFunctions

Applies a transformation to the underlying DataFrameStatFunctions.
def transformationWithAnalysis(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): TryAnalysis[DataFrameStatFunctions]

Applies a transformation to the underlying DataFrameStatFunctions, it is used for transformations that can fail due to an AnalysisException.
val underlying: org.apache.spark.sql.DataFrameStatFunctions
def unpack[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): Dataset[U]

Unpack the underlying DataFrameStatFunctions into a DataFrame.
def unpackWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): TryAnalysis[Dataset[U]]

Unpack the underlying DataFrameStatFunctions into a DataFrame, it is used for transformations that can fail due to an AnalysisException.
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package sql

final case class DataFrameStatFunctions(underlying: org.apache.spark.sql.DataFrameStatFunctions) extends Product with Serializable

Instance Constructors

new DataFrameStatFunctions(underlying: org.apache.spark.sql.DataFrameStatFunctions)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

def clone(): AnyRef

def corr(col1: String, col2: String): TryAnalysis[Double]

def corr(col1: String, col2: String, method: String): TryAnalysis[Double]

def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

def cov(col1: String, col2: String): TryAnalysis[Double]

def crosstab(col1: String, col2: String): TryAnalysis[DataFrame]

final def eq(arg0: AnyRef): Boolean

def finalize(): Unit

def freqItems(cols: Seq[String]): TryAnalysis[DataFrame]

def freqItems(cols: Seq[String], support: Double): TryAnalysis[DataFrame]

def get[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): U

final def getClass(): Class[_]

def getWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): TryAnalysis[U]

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): TryAnalysis[DataFrame]

final def synchronized[T0](arg0: ⇒ T0): T0

def transformation(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): DataFrameStatFunctions

def transformationWithAnalysis(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): TryAnalysis[DataFrameStatFunctions]

val underlying: org.apache.spark.sql.DataFrameStatFunctions

def unpack[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): Dataset[U]

def unpackWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): TryAnalysis[Dataset[U]]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped