sql

package sql

Ordering

Alphabetic

Visibility

Public
Protected

Package Members

package api
package catalog
package catalyst
package connector
package execution
package expressions
package internal
package streaming
package types
package util

Type Members

class AnalysisException extends Exception with SparkThrowable with Serializable with WithOrigin
Thrown when a query fails to analyze, usually because the query itself is invalid.
Thrown when a query fails to analyze, usually because the query itself is invalid.
Annotations
@Stable()
Since
1.3.0

class Column extends Logging

A column that will be computed based on the data in a DataFrame.

A new column can be constructed based on the input columns present in a DataFrame:

df("columnName")            // On a specific `df` DataFrame.
col("columnName")           // A generic column not yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.

Column objects can be composed to form complex expressions:

$"a" + 1
$"a" === $"b"

Annotations: @Stable()
Since: 1.3.0

class ColumnName extends Column
A convenient class used for constructing schema.
A convenient class used for constructing schema.
Annotations
@Stable()
Since
1.3.0
trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]
Trait to restrict calls to create and replace operations.
Trait to restrict calls to create and replace operations.
Since
3.0.0
abstract class DataFrameWriter[T] extends AnyRef
Interface used to write a org.apache.spark.sql.api.Dataset to external storage systems (e.g.
Interface used to write a org.apache.spark.sql.api.Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use Dataset.write to access this.
Annotations
@Stable()
Since
1.4.0
abstract class DataFrameWriterV2[T] extends CreateTableWriter[T]
Interface used to write a org.apache.spark.sql.api.Dataset to external storage using the v2 API.
Interface used to write a org.apache.spark.sql.api.Dataset to external storage using the v2 API.
Annotations
@Experimental()
Since
3.0.0
trait Encoder[T] extends Serializable
Used to convert a JVM object of type T to and from the internal Spark SQL representation.
Used to convert a JVM object of type T to and from the internal Spark SQL representation.
Scala
Encoders are generally created automatically through implicits from a SparkSession, or can be explicitly created by calling static methods on Encoders.
```
import spark.implicits._

val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
```
Java
Encoders are specified by calling static methods on Encoders.
```
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> ds = context.createDataset(data, Encoders.STRING());
```
Encoders can be composed into tuples:
```
Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING());
List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a");
Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);
```
Or constructed from Java Beans:
```
Encoders.bean(MyClass.class);
```
Implementation
- Encoders should be thread-safe.
Annotations
@implicitNotFound()
Since
1.6.0
abstract class ForeachWriter[T] extends Serializable
The abstract class for writing custom logic to process data generated by a query.
The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.
- A single instance of this class is responsible of all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
- Any implementation of this class must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
- The lifecycle of the methods are as follows.
```
 For each partition with `partitionId`: For each batch/epoch of streaming data (if its
streaming query) with `epochId`: Method `open(partitionId, epochId)` is called. If `open`
returns true: For each row in the partition and batch/epoch, method `process(row)` is called.
Method `close(errorOrNull)` is called with error (if any) seen while processing rows. 
```
Important points to note:
- Spark doesn't guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reason, Spark optimization changes number of partitions, etc. Refer SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
- The close() method will be called if open() method returns successfully (irrespective of the return value), except if the JVM crashes in the middle.
Scala example:
```
datasetOfString.writeStream.foreach(new ForeachWriter[String] {

  def open(partitionId: Long, version: Long): Boolean = {
    // open connection
  }

  def process(record: String) = {
    // write string to connection
  }

  def close(errorOrNull: Throwable): Unit = {
    // close the connection
  }
})
```
Java example:
```
datasetOfString.writeStream().foreach(new ForeachWriter<String>() {

  @Override
  public boolean open(long partitionId, long version) {
    // open connection
  }

  @Override
  public void process(String value) {
    // write string to connection
  }

  @Override
  public void close(Throwable errorOrNull) {
    // close the connection
  }
});
```
Since
2.0.0
abstract class MergeIntoWriter[T] extends AnyRef
MergeIntoWriter provides methods to define and execute merge actions based on specified conditions.
MergeIntoWriter provides methods to define and execute merge actions based on specified conditions.
Please note that schema evolution is disabled by default.
T
the type of data in the Dataset.
Annotations
@Experimental()
Since
4.0.0
class Observation extends AnyRef
Helper class to simplify usage of Dataset.observe(String, Column, Column*):
Helper class to simplify usage of Dataset.observe(String, Column, Column*):
```
// Observe row count (rows) and highest id (maxid) in the Dataset while writing it
val observation = Observation("my metrics")
val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
observed_ds.write.parquet("ds.parquet")
val metrics = observation.get
```
This collects the metrics while the first action is executed on the observed dataset. Subsequent actions do not modify the metrics returned by get. Retrieval of the metric via get blocks until the first action has finished and metrics become available.
This class does not support streaming datasets.
Since
3.3.0
trait Row extends Serializable
Represents one row of output from a relational operator.
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access.
It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in Scala.
A Row object can be constructed by providing field values. Example:
```
import org.apache.spark.sql._

// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
```
A value of a row can be accessed through both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. An example of generic access by ordinal:
```
import org.apache.spark.sql._

val row = Row(1, true, "a string", null)
// row: Row = [1,true,a string,null]
val firstValue = row(0)
// firstValue: Any = 1
val fourthValue = row(3)
// fourthValue: Any = null
```
For native primitive access, it is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null. An example of native primitive access:
```
// using the row from the previous example.
val firstValue = row.getInt(0)
// firstValue: Int = 1
val isNull = row.isNullAt(3)
// isNull: Boolean = true
```
In Scala, fields in a Row object can be extracted in a pattern match. Example:
```
import org.apache.spark.sql._

val pairs = sql("SELECT key, value FROM src").rdd.map {
  case Row(key: Int, value: String) =>
    key -> value
}
```
Annotations
@Stable()
Since
1.3.0
class RowFactory extends AnyRef
A factory class used to construct Row objects.
A factory class used to construct Row objects.
Annotations
@Stable()
Since
1.3.0
sealed final class SaveMode extends Enum[SaveMode]
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
Annotations
@Stable()
Since
1.3.0
class TypedColumn[-T, U] extends Column
A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type. To create a TypedColumn, use the as function on a Column.
T
The input type expected for this expression. Can be Any if the expression is type checked by the analyzer instead of the compiler (i.e. expr("sum(...)")).
U
The output type of this column.
Annotations
@Stable()
Since
1.6.0
case class WhenMatched[T] extends Product with Serializable
A class for defining actions to be taken when matching rows in a DataFrame during a merge operation.
A class for defining actions to be taken when matching rows in a DataFrame during a merge operation.
T
The type of data in the MergeIntoWriter.
case class WhenNotMatched[T] extends Product with Serializable
A class for defining actions to be taken when no matching rows are found in a DataFrame during a merge operation.
A class for defining actions to be taken when no matching rows are found in a DataFrame during a merge operation.
T
The type of data in the MergeIntoWriter.
case class WhenNotMatchedBySource[T] extends Product with Serializable
A class for defining actions to be performed when there is no match by source during a merge operation in a MergeIntoWriter.
A class for defining actions to be performed when there is no match by source during a merge operation in a MergeIntoWriter.
T
the type parameter for the MergeIntoWriter.
trait WriteConfigMethods[R] extends AnyRef
Configuration methods common to create/replace operations and insert/overwrite operations.
Configuration methods common to create/replace operations and insert/overwrite operations.
R
builder type to return
Since
3.0.0

Value Members

object Observation
(Scala-specific) Create instances of Observation via Scala apply.
(Scala-specific) Create instances of Observation via Scala apply.
Since
3.3.0
object Row extends Serializable
Annotations
@Stable()
Since
1.3.0
object functions
Commonly used functions available for DataFrame operations.
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.
You can call the functions defined here by two ways: _FUNC_(...) and functions.expr("_FUNC_(...)").
As an example, regr_count is a function that is defined here. You can use regr_count(col("yCol", col("xCol"))) to invoke the regr_count function. This way the programming language's compiler ensures regr_count exists and is of the proper form. You can also use expr("regr_count(yCol, xCol)") function to invoke the same function. In this case, Spark itself will ensure regr_count exists when it analyzes the query.
You can find the entire list of functions at SQL API documentation of your Spark version, see also the latest list
This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons.
Annotations
@Stable()
Since
1.3.0

Packages

sql

package sql

Package Members

Type Members

Scala

Java

Implementation

Value Members

Ungrouped

Packages

sql

package sql

Package Members

Type Members

Scala

Java

Implementation

Value Members

Ungrouped

sql