Class

zio.spark.sql

DataFrameStatFunctions

Related Doc: package sql

Permalink

final case class DataFrameStatFunctions(underlying: org.apache.spark.sql.DataFrameStatFunctions) extends Product with Serializable

Self Type
DataFrameStatFunctions
Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataFrameStatFunctions
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DataFrameStatFunctions(underlying: org.apache.spark.sql.DataFrameStatFunctions)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

    Permalink

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    col

    the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    numBits

    expected number of bits of the filter.

    Since

    2.0.0

  6. def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): TryAnalysis[BloomFilter]

    Permalink

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    colName

    name of the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    numBits

    expected number of bits of the filter.

    Since

    2.0.0

  7. def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

    Permalink

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    col

    the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    fpp

    expected false positive probability of the filter.

    Since

    2.0.0

  8. def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): TryAnalysis[BloomFilter]

    Permalink

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    colName

    name of the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    fpp

    expected false positive probability of the filter.

    Since

    2.0.0

  9. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. def corr(col1: String, col2: String): TryAnalysis[Double]

    Permalink

    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

    col1

    the name of the column

    col2

    the name of the column to calculate the correlation against

    returns

    The Pearson Correlation Coefficient as a Double.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2", "pearson")
    res1: Double = 0.613...
    Since

    1.4.0

  11. def corr(col1: String, col2: String, method: String): TryAnalysis[Double]

    Permalink

    Calculates the correlation of two columns of a DataFrame.

    Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

    col1

    the name of the column

    col2

    the name of the column to calculate the correlation against

    returns

    The Pearson Correlation Coefficient as a Double.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2")
    res1: Double = 0.613...
    Since

    1.4.0

  12. def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

    Permalink

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    col

    the column over which the sketch is built

    eps

    relative error of the sketch

    confidence

    confidence of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  13. def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

    Permalink

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    col

    the column over which the sketch is built

    depth

    depth of the sketch

    width

    width of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  14. def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): TryAnalysis[CountMinSketch]

    Permalink

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    colName

    name of the column over which the sketch is built

    eps

    relative error of the sketch

    confidence

    confidence of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  15. def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): TryAnalysis[CountMinSketch]

    Permalink

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    colName

    name of the column over which the sketch is built

    depth

    depth of the sketch

    width

    width of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  16. def cov(col1: String, col2: String): TryAnalysis[Double]

    Permalink

    Calculate the sample covariance of two numerical columns of a DataFrame.

    Calculate the sample covariance of two numerical columns of a DataFrame.

    col1

    the name of the first column

    col2

    the name of the second column

    returns

    the covariance of the two columns.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.cov("rand1", "rand2")
    res1: Double = 0.065...
    Since

    1.4.0

  17. def crosstab(col1: String, col2: String): TryAnalysis[DataFrame]

    Permalink

    Computes a pair-wise frequency table of the given columns.

    Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

    col1

    The name of the first column. Distinct items will make the first item of each row.

    col2

    The name of the second column. Distinct items will make the column names of the DataFrame.

    returns

    A DataFrame containing for the contingency table.

    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
      .toDF("key", "value")
    val ct = df.stat.crosstab("key", "value")
    ct.show()
    +---------+---+---+---+
    |key_value|  1|  2|  3|
    +---------+---+---+---+
    |        2|  2|  0|  1|
    |        1|  1|  1|  0|
    |        3|  0|  1|  1|
    +---------+---+---+---+
    Since

    1.4.0

  18. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  20. def freqItems(cols: Seq[String]): TryAnalysis[DataFrame]

    Permalink

    Finding frequent items for columns, possibly with false positives.

    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="http://dx.doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou. Uses a `default` support of 1%. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting `DataFrame`.

    cols

    the names of the columns to search frequent items in.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    Since

    1.4.0

  21. def freqItems(cols: Seq[String], support: Double): TryAnalysis[DataFrame]

    Permalink

    Finding frequent items for columns, possibly with false positives.

    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="http://dx.doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting `DataFrame`.

    cols

    the names of the columns to search frequent items in.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    val rows = Seq.tabulate(100) { i =>
      if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
    }
    val df = spark.createDataFrame(rows).toDF("a", "b")
    // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
    // "a" and "b"
    val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
    freqSingles.show()
    +-----------+-------------+
    |a_freqItems|  b_freqItems|
    +-----------+-------------+
    |    [1, 99]|[-1.0, -99.0]|
    +-----------+-------------+
    // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
    val pairDf = df.select(struct("a", "b").as("a-b"))
    val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
    freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
    +----------+
    |   freq_ab|
    +----------+
    |  [1,-1.0]|
    |   ...    |
    +----------+
    Since

    1.4.0

  22. def get[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): U

    Permalink

    Applies an action to the underlying DataFrameStatFunctions.

  23. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  24. def getWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ U): TryAnalysis[U]

    Permalink

    Applies an action to the underlying DataFrameStatFunctions, it is used for transformations that can fail due to an AnalysisException.

  25. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  26. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  27. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  28. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  29. def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): TryAnalysis[DataFrame]

    Permalink

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    T

    stratum type

    col

    column that defines strata

    fractions

    sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    seed

    random seed

    returns

    a new DataFrame that represents the stratified sample

    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
      (3, 3))).toDF("key", "value")
    val fractions = Map(1 -> 1.0, 3 -> 0.5)
    df.stat.sampleBy("key", fractions, 36L).show()
    +---+-----+
    |key|value|
    +---+-----+
    |  1|    1|
    |  1|    2|
    |  3|    2|
    +---+-----+
    Since

    1.5.0

  30. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  31. def transformation(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): DataFrameStatFunctions

    Permalink

    Applies a transformation to the underlying DataFrameStatFunctions.

  32. def transformationWithAnalysis(f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.DataFrameStatFunctions): TryAnalysis[DataFrameStatFunctions]

    Permalink

    Applies a transformation to the underlying DataFrameStatFunctions, it is used for transformations that can fail due to an AnalysisException.

  33. val underlying: org.apache.spark.sql.DataFrameStatFunctions

    Permalink
  34. def unpack[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): Dataset[U]

    Permalink

    Unpack the underlying DataFrameStatFunctions into a DataFrame.

  35. def unpackWithAnalysis[U](f: (org.apache.spark.sql.DataFrameStatFunctions) ⇒ org.apache.spark.sql.Dataset[U]): TryAnalysis[Dataset[U]]

    Permalink

    Unpack the underlying DataFrameStatFunctions into a DataFrame, it is used for transformations that can fail due to an AnalysisException.

  36. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped