Class

org.apache.spark.sql

DataFrame

Related Doc: package sql

Permalink

class DataFrame extends Serializable

:: Experimental :: A distributed collection of data organized into named columns.

A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.

val people = sqlContext.read.parquet("...")  // in Scala
DataFrame people = sqlContext.read().parquet("...")  // in Java

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala
Column ageCol = people.col("age")  // in Java

Note that the Column type can also be manipulated through its various functions.

// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java

A more concrete example in Scala:

// To create DataFrame using SQLContext
val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

and in Java:

// To create DataFrame using SQLContext
DataFrame people = sqlContext.read().parquet("...");
DataFrame department = sqlContext.read().parquet("...");

people.filter("age".gt(30))
  .join(department, people.col("deptId").equalTo(department("id")))
  .groupBy(department.col("name"), "gender")
  .agg(avg(people.col("salary")), max(people.col("age")));
Annotations
@Experimental()
Since

1.3.0

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DataFrame
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DataFrame(sqlContext: SQLContext, logicalPlan: LogicalPlan)

    Permalink

    A constructor that automatically analyzes the logical plan.

    A constructor that automatically analyzes the logical plan.

    This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def agg(expr: Column, exprs: Column*): DataFrame

    Permalink

    Aggregates on the entire DataFrame without groups.

    Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(max($"age"), avg($"salary"))
    df.groupBy().agg(max($"age"), avg($"salary"))
    Annotations
    @varargs()
    Since

    1.3.0

  5. def agg(exprs: Map[String, String]): DataFrame

    Permalink

    (Java-specific) Aggregates on the entire DataFrame without groups.

    (Java-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(Map("age" -> "max", "salary" -> "avg"))
    df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Since

    1.3.0

  6. def agg(exprs: Map[String, String]): DataFrame

    Permalink

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(Map("age" -> "max", "salary" -> "avg"))
    df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Since

    1.3.0

  7. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    Permalink

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg("age" -> "max", "salary" -> "avg")
    df.groupBy().agg("age" -> "max", "salary" -> "avg")
    Since

    1.3.0

  8. def apply(colName: String): Column

    Permalink

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

    Since

    1.3.0

  9. def as(alias: Symbol): DataFrame

    Permalink

    (Scala-specific) Returns a new DataFrame with an alias set.

    (Scala-specific) Returns a new DataFrame with an alias set.

    Since

    1.3.0

  10. def as(alias: String): DataFrame

    Permalink

    Returns a new DataFrame with an alias set.

    Returns a new DataFrame with an alias set.

    Since

    1.3.0

  11. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  12. def cache(): DataFrame.this.type

    Permalink

    Since

    1.3.0

  13. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  14. def coalesce(numPartitions: Int): DataFrame

    Permalink

    Returns a new DataFrame that has exactly numPartitions partitions.

    Returns a new DataFrame that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

    Since

    1.4.0

  15. def col(colName: String): Column

    Permalink

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

    Since

    1.3.0

  16. def collect(): Array[Row]

    Permalink

    Returns an array that contains all of Rows in this DataFrame.

    Returns an array that contains all of Rows in this DataFrame.

    Since

    1.3.0

  17. def collectAsList(): List[Row]

    Permalink

    Returns a Java list that contains all of Rows in this DataFrame.

    Returns a Java list that contains all of Rows in this DataFrame.

    Since

    1.3.0

  18. def columns: Array[String]

    Permalink

    Returns all column names as an array.

    Returns all column names as an array.

    Since

    1.3.0

  19. def count(): Long

    Permalink

    Returns the number of rows in the DataFrame.

    Returns the number of rows in the DataFrame.

    Since

    1.3.0

  20. def cube(col1: String, cols: String*): GroupedData

    Permalink

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns cubed by department and group.
    df.cube("department", "group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    df.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  21. def cube(cols: Column*): GroupedData

    Permalink

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns cubed by department and group.
    df.cube($"department", $"group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    df.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  22. def describe(cols: String*): DataFrame

    Permalink

    Computes statistics for numeric columns, including count, mean, stddev, min, and max.

    Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.

    df.describe("age", "height").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // max     92.0  192.0
    Annotations
    @varargs()
    Since

    1.3.1

  23. def distinct(): DataFrame

    Permalink

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

    Since

    1.3.0

  24. def drop(col: Column): DataFrame

    Permalink

    Returns a new DataFrame with a column dropped.

    Returns a new DataFrame with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.

    Since

    1.4.1

  25. def drop(colName: String): DataFrame

    Permalink

    Returns a new DataFrame with a column dropped.

    Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

    Since

    1.4.0

  26. def dropDuplicates(colNames: Array[String]): DataFrame

    Permalink

    Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Since

    1.4.0

  27. def dropDuplicates(colNames: Seq[String]): DataFrame

    Permalink

    (Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    (Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Since

    1.4.0

  28. def dropDuplicates(): DataFrame

    Permalink

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

    Since

    1.4.0

  29. def dtypes: Array[(String, String)]

    Permalink

    Returns all column names and their data types as an array.

    Returns all column names and their data types as an array.

    Since

    1.3.0

  30. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  31. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  32. def except(other: DataFrame): DataFrame

    Permalink

    Returns a new DataFrame containing rows in this frame but not in another frame.

    Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

    Since

    1.3.0

  33. def explain(): Unit

    Permalink

    Only prints the physical plan to the console for debugging purposes.

    Only prints the physical plan to the console for debugging purposes.

    Since

    1.3.0

  34. def explain(extended: Boolean): Unit

    Permalink

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    Since

    1.3.0

  35. def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

    Permalink

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

    df.explode("words", "word"){words: String => words.split(" ")}
    Since

    1.3.0

  36. def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

    Permalink

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

    The following example uses this function to count the number of books which contain a given word:

    case class Book(title: String, words: String)
    val df: RDD[Book]
    
    case class Word(word: String)
    val allWords = df.explode('words) {
      case Row(words: String) => words.split(" ").map(Word(_))
    }
    
    val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
    Since

    1.3.0

  37. def filter(conditionExpr: String): DataFrame

    Permalink

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDf.filter("age > 15")
    Since

    1.3.0

  38. def filter(condition: Column): DataFrame

    Permalink

    Filters rows using the given condition.

    Filters rows using the given condition.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    Since

    1.3.0

  39. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  40. def first(): Row

    Permalink

    Returns the first row.

    Returns the first row. Alias for head().

    Since

    1.3.0

  41. def flatMap[R](f: (Row) ⇒ TraversableOnce[R])(implicit arg0: ClassTag[R]): RDD[R]

    Permalink

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Since

    1.3.0

  42. def foreach(f: (Row) ⇒ Unit): Unit

    Permalink

    Applies a function f to all rows.

    Applies a function f to all rows.

    Since

    1.3.0

  43. def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

    Permalink

    Applies a function f to each partition of this DataFrame.

    Applies a function f to each partition of this DataFrame.

    Since

    1.3.0

  44. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  45. def groupBy(col1: String, cols: String*): GroupedData

    Permalink

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns grouped by department.
    df.groupBy("department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.3.0

  46. def groupBy(cols: Column*): GroupedData

    Permalink

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns grouped by department.
    df.groupBy($"department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.3.0

  47. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  48. def head(): Row

    Permalink

    Returns the first row.

    Returns the first row.

    Since

    1.3.0

  49. def head(n: Int): Array[Row]

    Permalink

    Returns the first n rows.

    Returns the first n rows.

    Since

    1.3.0

  50. def inputFiles: Array[String]

    Permalink

    Returns a best-effort snapshot of the files that compose this DataFrame.

    Returns a best-effort snapshot of the files that compose this DataFrame. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

  51. def intersect(other: DataFrame): DataFrame

    Permalink

    Returns a new DataFrame containing rows only in both this frame and another frame.

    Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

    Since

    1.3.0

  52. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  53. def isLocal: Boolean

    Permalink

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Since

    1.3.0

  54. def javaRDD: JavaRDD[Row]

    Permalink

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Since

    1.3.0

  55. def javaToPython: JavaRDD[Array[Byte]]

    Permalink

    Converts a JavaRDD to a PythonRDD.

    Converts a JavaRDD to a PythonRDD.

    Attributes
    protected[org.apache.spark.sql]
  56. def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame

    Permalink

    Join with another DataFrame, using the given join expression.

    Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.

    // Scala:
    import org.apache.spark.sql.functions._
    df1.join(df2, $"df1Key" === $"df2Key", "outer")
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
    right

    Right side of the join.

    joinExprs

    Join expression.

    joinType

    One of: inner, outer, left_outer, right_outer, leftsemi.

    Since

    1.3.0

  57. def join(right: DataFrame, joinExprs: Column): DataFrame

    Permalink

    Inner join with another DataFrame, using the given join expression.

    Inner join with another DataFrame, using the given join expression.

    // The following two are equivalent:
    df1.join(df2, $"df1Key" === $"df2Key")
    df1.join(df2).where($"df1Key" === $"df2Key")
    Since

    1.3.0

  58. def join(right: DataFrame, usingColumns: Seq[String]): DataFrame

    Permalink

    Inner equi-join with another DataFrame using the given columns.

    Inner equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the columns "user_id" and "user_name"
    df1.join(df2, Seq("user_id", "user_name"))

    Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    Since

    1.4.0

  59. def join(right: DataFrame, usingColumn: String): DataFrame

    Permalink

    Inner equi-join with another DataFrame using the given column.

    Inner equi-join with another DataFrame using the given column.

    Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the column "user_id"
    df1.join(df2, "user_id")

    Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

    right

    Right side of the join operation.

    usingColumn

    Name of the column to join on. This column must exist on both sides.

    Since

    1.4.0

  60. def join(right: DataFrame): DataFrame

    Permalink

    Cartesian join with another DataFrame.

    Cartesian join with another DataFrame.

    Note that cartesian joins are very expensive without an extra filter that can be pushed down.

    right

    Right side of the join operation.

    Since

    1.3.0

  61. def limit(n: Int): DataFrame

    Permalink

    Returns a new DataFrame by taking the first n rows.

    Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

    Since

    1.3.0

  62. val logicalPlan: LogicalPlan

    Permalink
    Attributes
    protected[org.apache.spark.sql]
  63. def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]

    Permalink

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Since

    1.3.0

  64. def mapPartitions[R](f: (Iterator[Row]) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

    Permalink

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Since

    1.3.0

  65. def na: DataFrameNaFunctions

    Permalink

    Returns a DataFrameNaFunctions for working with missing data.

    Returns a DataFrameNaFunctions for working with missing data.

    // Dropping rows containing any null values.
    df.na.drop()
    Since

    1.3.1

  66. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  67. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  68. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  69. def numericColumns: Seq[Expression]

    Permalink
    Attributes
    protected[org.apache.spark.sql]
  70. def orderBy(sortExprs: Column*): DataFrame

    Permalink

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
    Since

    1.3.0

  71. def orderBy(sortCol: String, sortCols: String*): DataFrame

    Permalink

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
    Since

    1.3.0

  72. def persist(newLevel: StorageLevel): DataFrame.this.type

    Permalink

    Since

    1.3.0

  73. def persist(): DataFrame.this.type

    Permalink

    Since

    1.3.0

  74. def printSchema(): Unit

    Permalink

    Prints the schema to the console in a nice tree format.

    Prints the schema to the console in a nice tree format.

    Since

    1.3.0

  75. val queryExecution: QueryExecution

    Permalink
  76. def randomSplit(weights: Array[Double]): Array[DataFrame]

    Permalink

    Randomly splits this DataFrame with the provided weights.

    Randomly splits this DataFrame with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    Since

    1.4.0

  77. def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]

    Permalink

    Randomly splits this DataFrame with the provided weights.

    Randomly splits this DataFrame with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling.

    Since

    1.4.0

  78. lazy val rdd: RDD[Row]

    Permalink

    Represents the content of the DataFrame as an RDD of Rows.

    Represents the content of the DataFrame as an RDD of Rows. Note that the RDD is memoized. Once called, it won't change even if you change any query planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).

    Since

    1.3.0

  79. def registerTempTable(tableName: String): Unit

    Permalink

    Registers this DataFrame as a temporary table using the given name.

    Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

    Since

    1.3.0

  80. def repartition(numPartitions: Int): DataFrame

    Permalink

    Returns a new DataFrame that has exactly numPartitions partitions.

    Returns a new DataFrame that has exactly numPartitions partitions.

    Since

    1.3.0

  81. def resolve(colName: String): NamedExpression

    Permalink
    Attributes
    protected[org.apache.spark.sql]
  82. def rollup(col1: String, cols: String*): GroupedData

    Permalink

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns rolluped by department and group.
    df.rollup("department", "group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    df.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  83. def rollup(cols: Column*): GroupedData

    Permalink

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns rolluped by department and group.
    df.rollup($"department", $"group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    df.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  84. def sample(withReplacement: Boolean, fraction: Double): DataFrame

    Permalink

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    Since

    1.3.0

  85. def sample(withReplacement: Boolean, fraction: Double, seed: Long): DataFrame

    Permalink

    Returns a new DataFrame by sampling a fraction of rows.

    Returns a new DataFrame by sampling a fraction of rows.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    seed

    Seed for sampling.

    Since

    1.3.0

  86. def schema: StructType

    Permalink

    Returns the schema of this DataFrame.

    Returns the schema of this DataFrame.

    Since

    1.3.0

  87. def select(col: String, cols: String*): DataFrame

    Permalink

    Selects a set of columns.

    Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

    // The following two are equivalent:
    df.select("colA", "colB")
    df.select($"colA", $"colB")
    Annotations
    @varargs()
    Since

    1.3.0

  88. def select(cols: Column*): DataFrame

    Permalink

    Selects a set of column based expressions.

    Selects a set of column based expressions.

    df.select($"colA", $"colB" + 1)
    Annotations
    @varargs()
    Since

    1.3.0

  89. def selectExpr(exprs: String*): DataFrame

    Permalink

    Selects a set of SQL expressions.

    Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

    df.selectExpr("colA", "colB as newName", "abs(colC)")
    Annotations
    @varargs()
    Since

    1.3.0

  90. def show(numRows: Int, truncate: Boolean): Unit

    Permalink

    Displays the DataFrame in a tabular form.

    Displays the DataFrame in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.5.0

  91. def show(truncate: Boolean): Unit

    Permalink

    Displays the top 20 rows of DataFrame in a tabular form.

    Displays the top 20 rows of DataFrame in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.5.0

  92. def show(): Unit

    Permalink

    Displays the top 20 rows of DataFrame in a tabular form.

    Displays the top 20 rows of DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Since

    1.3.0

  93. def show(numRows: Int): Unit

    Permalink

    Displays the DataFrame in a tabular form.

    Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Since

    1.3.0

  94. def sort(sortExprs: Column*): DataFrame

    Permalink

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. For example:

    df.sort($"col1", $"col2".desc)
    Annotations
    @varargs()
    Since

    1.3.0

  95. def sort(sortCol: String, sortCols: String*): DataFrame

    Permalink

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    // The following 3 are equivalent
    df.sort("sortcol")
    df.sort($"sortcol")
    df.sort($"sortcol".asc)
    Annotations
    @varargs()
    Since

    1.3.0

  96. val sqlContext: SQLContext

    Permalink
  97. def stat: DataFrameStatFunctions

    Permalink

    Returns a DataFrameStatFunctions for working statistic functions support.

    Returns a DataFrameStatFunctions for working statistic functions support.

    // Finding frequent items in column with name 'a'.
    df.stat.freqItems(Seq("a"))
    Since

    1.4.0

  98. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  99. def take(n: Int): Array[Row]

    Permalink

    Returns the first n rows in the DataFrame.

    Returns the first n rows in the DataFrame.

    Since

    1.3.0

  100. def toDF(colNames: String*): DataFrame

    Permalink

    Returns a new DataFrame with columns renamed.

    Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
    Annotations
    @varargs()
    Since

    1.3.0

  101. def toDF(): DataFrame

    Permalink

    Returns the object itself.

    Returns the object itself.

    Since

    1.3.0

  102. def toJSON: RDD[String]

    Permalink

    Returns the content of the DataFrame as a RDD of JSON strings.

    Returns the content of the DataFrame as a RDD of JSON strings.

    Since

    1.3.0

  103. def toJavaRDD: JavaRDD[Row]

    Permalink

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Since

    1.3.0

  104. def toString(): String

    Permalink
    Definition Classes
    DataFrame → AnyRef → Any
  105. def unionAll(other: DataFrame): DataFrame

    Permalink

    Returns a new DataFrame containing union of rows in this frame and another frame.

    Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

    Since

    1.3.0

  106. def unpersist(): DataFrame.this.type

    Permalink

    Since

    1.3.0

  107. def unpersist(blocking: Boolean): DataFrame.this.type

    Permalink

    Since

    1.3.0

  108. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  109. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  110. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  111. def where(conditionExpr: String): DataFrame

    Permalink

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDf.where("age > 15")
    Since

    1.5.0

  112. def where(condition: Column): DataFrame

    Permalink

    Filters rows using the given condition.

    Filters rows using the given condition. This is an alias for filter.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    Since

    1.3.0

  113. def withColumn(colName: String, col: Column): DataFrame

    Permalink

    Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

    Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

    Since

    1.3.0

  114. def withColumnRenamed(existingName: String, newName: String): DataFrame

    Permalink

    Returns a new DataFrame with a column renamed.

    Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName.

    Since

    1.3.0

  115. def write: DataFrameWriter

    Permalink

    :: Experimental :: Interface for saving the content of the DataFrame out into external storage.

    :: Experimental :: Interface for saving the content of the DataFrame out into external storage.

    Annotations
    @Experimental()
    Since

    1.4.0

Deprecated Value Members

  1. def createJDBCTable(url: String, table: String, allowExisting: Boolean): Unit

    Permalink

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. This will run a CREATE TABLE and a bunch of INSERT INTO statements. If you pass true for allowExisting, it will drop any table with the given name; if you pass false, it will throw if the table already exists.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.jdbc()

  2. def insertInto(tableName: String): Unit

    Permalink

    Adds the rows from this RDD to the specified table.

    Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(SaveMode.Append).saveAsTable(tableName)

  3. def insertInto(tableName: String, overwrite: Boolean): Unit

    Permalink

    Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(SaveMode.Append|SaveMode.Overwrite).saveAsTable(tableName)

  4. def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit

    Permalink

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs.

    The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.jdbc()

  5. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    Permalink

    (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).options(options).save()

  6. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    Permalink

    Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).options(options).save()

  7. def save(path: String, source: String, mode: SaveMode): Unit

    Permalink

    Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).save(path)

  8. def save(path: String, source: String): Unit

    Permalink

    Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).save(path)

  9. def save(path: String, mode: SaveMode): Unit

    Permalink

    Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

    Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(mode).save(path)

  10. def save(path: String): Unit

    Permalink

    Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.save(path)

  11. def saveAsParquetFile(path: String): Unit

    Permalink

    Saves the contents of this DataFrame as a parquet file, preserving the schema.

    Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the parquetFile function in SQLContext.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.parquet(path)

  12. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    Permalink

    (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).options(options).saveAsTable(tableName)

  13. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    Permalink

    Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).options(options).saveAsTable(tableName)

  14. def saveAsTable(tableName: String, source: String, mode: SaveMode): Unit

    Permalink

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).mode(mode).saveAsTable(tableName)

  15. def saveAsTable(tableName: String, source: String): Unit

    Permalink

    Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).saveAsTable(tableName)

  16. def saveAsTable(tableName: String, mode: SaveMode): Unit

    Permalink

    Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(mode).saveAsTable(tableName)

  17. def saveAsTable(tableName: String): Unit

    Permalink

    Creates a table from the the contents of this DataFrame.

    Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.saveAsTable(tableName)

  18. def toSchemaRDD: DataFrame

    Permalink

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) use toDF

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Actions

Basic DataFrame functions

Language Integrated Queries

Output Operations

RDD Operations

Ungrouped