Class

org.apache.spark.sql

Dataset

Related Doc: package sql

Permalink

class Dataset[T] extends Queryable with Serializable with Logging

:: Experimental :: A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations.

A Dataset differs from an RDD in the following ways:

A Dataset can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic Row container. A DataFrame can be transformed into specific Dataset by calling df.as[ElementType]. Similarly you can transform a strongly-typed Dataset to a generic DataFrame by calling ds.toDF().

COMPATIBILITY NOTE: Long term we plan to make DataFrame extend Dataset[Row]. However, making this change to the class hierarchy would break the function signatures for the existing functional operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6.

Annotations
@Experimental()
Since

1.6.0

Linear Supertypes
Logging, Serializable, Serializable, Queryable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. Dataset
  2. Logging
  3. Serializable
  4. Serializable
  5. Queryable
  6. AnyRef
  7. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def as(alias: String): Dataset[T]

    Permalink

    Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.

    Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.

    Since

    1.6.0

  5. def as[U](implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    Returns a new Dataset where each record has been mapped on to the specified type.

    Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:

    • When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive)
    • When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1).
    • When U is a primitive type (i.e. String, Int, etc). then the first column of the DataFrame will be used.

    If the schema of the DataFrame does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.

    Since

    1.6.0

  6. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  7. def cache(): Dataset.this.type

    Permalink

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Since

    1.6.0

  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. def coalesce(numPartitions: Int): Dataset[T]

    Permalink

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

    Since

    1.6.0

  10. def collect(): Array[T]

    Permalink

    Returns an array that contains all the elements in this Dataset.

    Returns an array that contains all the elements in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Since

    1.6.0

  11. def collectAsList(): List[T]

    Permalink

    Returns an array that contains all the elements in this Dataset.

    Returns an array that contains all the elements in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Since

    1.6.0

  12. def count(): Long

    Permalink

    Returns the number of elements in the Dataset.

    Returns the number of elements in the Dataset.

    Since

    1.6.0

  13. def distinct: Dataset[T]

    Permalink

    Returns a new Dataset that contains only the unique elements of this Dataset.

    Returns a new Dataset that contains only the unique elements of this Dataset.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  14. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  16. def explain(): Unit

    Permalink

    Prints the physical plan to the console for debugging purposes.

    Prints the physical plan to the console for debugging purposes.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  17. def explain(extended: Boolean): Unit

    Permalink

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  18. def filter(func: FilterFunction[T]): Dataset[T]

    Permalink

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    Since

    1.6.0

  19. def filter(func: (T) ⇒ Boolean): Dataset[T]

    Permalink

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    Since

    1.6.0

  20. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. def first(): T

    Permalink

    Returns the first element in this Dataset.

    Returns the first element in this Dataset.

    Since

    1.6.0

  22. def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Since

    1.6.0

  23. def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Since

    1.6.0

  24. def foreach(func: ForeachFunction[T]): Unit

    Permalink

    (Java-specific) Runs func on each element of this Dataset.

    (Java-specific) Runs func on each element of this Dataset.

    Since

    1.6.0

  25. def foreach(func: (T) ⇒ Unit): Unit

    Permalink

    (Scala-specific) Runs func on each element of this Dataset.

    (Scala-specific) Runs func on each element of this Dataset.

    Since

    1.6.0

  26. def foreachPartition(func: ForeachPartitionFunction[T]): Unit

    Permalink

    (Java-specific) Runs func on each partition of this Dataset.

    (Java-specific) Runs func on each partition of this Dataset.

    Since

    1.6.0

  27. def foreachPartition(func: (Iterator[T]) ⇒ Unit): Unit

    Permalink

    (Scala-specific) Runs func on each partition of this Dataset.

    (Scala-specific) Runs func on each partition of this Dataset.

    Since

    1.6.0

  28. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  29. def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T]

    Permalink

    (Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    (Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    Since

    1.6.0

  30. def groupBy(cols: Column*): GroupedDataset[Row, T]

    Permalink

    Returns a GroupedDataset where the data is grouped by the given Column expressions.

    Returns a GroupedDataset where the data is grouped by the given Column expressions.

    Annotations
    @varargs()
    Since

    1.6.0

  31. def groupBy[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): GroupedDataset[K, T]

    Permalink

    (Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    (Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    Since

    1.6.0

  32. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  33. def intersect(other: Dataset[T]): Dataset[T]

    Permalink

    Returns a new Dataset that contains only the elements of this Dataset that are also present in other.

    Returns a new Dataset that contains only the elements of this Dataset that are also present in other.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  34. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  35. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  36. def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

    Permalink

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    other

    Right side of the join.

    condition

    Join expression.

    Since

    1.6.0

  37. def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

    Permalink

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.

    This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

    other

    Right side of the join.

    condition

    Join expression.

    joinType

    One of: inner, outer, left_outer, right_outer, leftsemi.

    Since

    1.6.0

  38. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  39. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  40. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  41. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  42. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  43. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  44. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  45. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  46. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  47. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  48. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  49. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  50. def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    Since

    1.6.0

  51. def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    Since

    1.6.0

  52. def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    (Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

    (Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Since

    1.6.0

  53. def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Since

    1.6.0

  54. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  55. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  56. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  57. def persist(newLevel: StorageLevel): Dataset.this.type

    Permalink

    Persist this Dataset with the given storage level.

    Persist this Dataset with the given storage level.

    newLevel

    One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Since

    1.6.0

  58. def persist(): Dataset.this.type

    Permalink

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Since

    1.6.0

  59. def printSchema(): Unit

    Permalink

    Prints the schema of the underlying Dataset to the console in a nice tree format.

    Prints the schema of the underlying Dataset to the console in a nice tree format.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  60. val queryExecution: QueryExecution

    Permalink
    Definition Classes
    Dataset → Queryable
  61. def rdd: RDD[T]

    Permalink

    Converts this Dataset to an RDD.

    Converts this Dataset to an RDD.

    Since

    1.6.0

  62. def reduce(func: ReduceFunction[T]): T

    Permalink

    (Java-specific) Reduces the elements of this Dataset using the specified binary function.

    (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Since

    1.6.0

  63. def reduce(func: (T, T) ⇒ T): T

    Permalink

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function.

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Since

    1.6.0

  64. def repartition(numPartitions: Int): Dataset[T]

    Permalink

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions.

    Since

    1.6.0

  65. def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

    Permalink

    Returns a new Dataset by sampling a fraction of records, using a random seed.

    Returns a new Dataset by sampling a fraction of records, using a random seed.

    Since

    1.6.0

  66. def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

    Permalink

    Returns a new Dataset by sampling a fraction of records.

    Returns a new Dataset by sampling a fraction of records.

    Since

    1.6.0

  67. def schema: StructType

    Permalink

    Returns the schema of the encoded form of the objects in this Dataset.

    Returns the schema of the encoded form of the objects in this Dataset.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  68. def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

    Permalink

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  69. def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

    Permalink

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  70. def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

    Permalink

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  71. def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

    Permalink

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  72. def select[U1](c1: TypedColumn[T, U1])(implicit arg0: Encoder[U1]): Dataset[U1]

    Permalink

    Returns a new Dataset by computing the given Column expression for each element.

    Returns a new Dataset by computing the given Column expression for each element.

    val ds = Seq(1, 2, 3).toDS()
    val newDS = ds.select(expr("value + 1").as[Int])
    Since

    1.6.0

  73. def select(cols: Column*): DataFrame

    Permalink

    Returns a new DataFrame by selecting a set of column based expressions.

    Returns a new DataFrame by selecting a set of column based expressions.

    df.select($"colA", $"colB" + 1)
    Attributes
    protected
    Annotations
    @varargs()
    Since

    1.6.0

  74. def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

    Permalink

    Internal helper function for building typed selects that return tuples.

    Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

    Attributes
    protected
  75. def show(numRows: Int, truncate: Boolean): Unit

    Permalink

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.6.0

  76. def show(truncate: Boolean): Unit

    Permalink

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.6.0

  77. def show(): Unit

    Permalink

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Since

    1.6.0

  78. def show(numRows: Int): Unit

    Permalink

    Displays the content of this Dataset in a tabular form.

    Displays the content of this Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Since

    1.6.0

  79. val sqlContext: SQLContext

    Permalink
    Definition Classes
    Dataset → Queryable
  80. def subtract(other: Dataset[T]): Dataset[T]

    Permalink

    Returns a new Dataset where any elements present in other have been removed.

    Returns a new Dataset where any elements present in other have been removed.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  81. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  82. def take(num: Int): Array[T]

    Permalink

    Returns the first num elements of this Dataset as an array.

    Returns the first num elements of this Dataset as an array.

    Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

    Since

    1.6.0

  83. def takeAsList(num: Int): List[T]

    Permalink

    Returns the first num elements of this Dataset as an array.

    Returns the first num elements of this Dataset as an array.

    Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

    Since

    1.6.0

  84. def toDF(): DataFrame

    Permalink

    Converts this strongly typed collection of data to generic Dataframe.

    Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.

  85. def toDS(): Dataset[T]

    Permalink

    Returns this Dataset.

    Returns this Dataset.

    Since

    1.6.0

  86. def toString(): String

    Permalink
    Definition Classes
    Queryable → AnyRef → Any
  87. def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]

    Permalink

    Concise syntax for chaining custom transformations.

    Concise syntax for chaining custom transformations.

    def featurize(ds: Dataset[T]) = ...
    
    dataset
      .transform(featurize)
      .transform(...)
    Since

    1.6.0

  88. def union(other: Dataset[T]): Dataset[T]

    Permalink

    Returns a new Dataset that contains the elements of both this and the other Dataset combined.

    Returns a new Dataset that contains the elements of both this and the other Dataset combined.

    Note that, this function is not a typical set union operation, in that it does not eliminate duplicate items. As such, it is analogous to UNION ALL in SQL.

    Since

    1.6.0

  89. def unpersist(): Dataset.this.type

    Permalink

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Since

    1.6.0

  90. def unpersist(blocking: Boolean): Dataset.this.type

    Permalink

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    blocking

    Whether to block until all blocks are deleted.

    Since

    1.6.0

  91. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  92. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  93. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from Serializable

Inherited from Serializable

Inherited from Queryable

Inherited from AnyRef

Inherited from Any

basic

Ungrouped