sql

package sql

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

sql
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Package Members

package application
package avro
package catalog
package connect
package expressions
package internal
package protobuf
package streaming

Type Members

class Column extends Logging

A column that will be computed based on the data in a DataFrame.

A new column can be constructed based on the input columns present in a DataFrame:

df("columnName")            // On a specific `df` DataFrame.
col("columnName")           // A generic column not yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.

Column objects can be composed to form complex expressions:

$"a" + 1

Since: 3.4.0

class ColumnName extends Column
A convenient class used for constructing schema.
A convenient class used for constructing schema.
Since
3.4.0
trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]
Trait to restrict calls to create and replace operations.
Trait to restrict calls to create and replace operations.
Since
3.4.0
type DataFrame = Dataset[Row]
final class DataFrameNaFunctions extends AnyRef
Functionality for working with missing data in DataFrames.
Functionality for working with missing data in DataFrames.
Since
3.4.0
class DataFrameReader extends Logging
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access this.
Annotations
@Stable()
Since
3.4.0
final class DataFrameStatFunctions extends AnyRef
Statistic functions for DataFrames.
Statistic functions for DataFrames.
Since
3.4.0
final class DataFrameWriter[T] extends AnyRef
Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use Dataset.write to access this.
Annotations
@Stable()
Since
3.4.0
final class DataFrameWriterV2[T] extends CreateTableWriter[T]
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
Annotations
@Experimental()
Since
3.4.0
class Dataset[T] extends Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.
```
val people = spark.read.parquet("...").as[Person]  // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
```
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
```
val names = people.map(_.name)  // in Scala; names is a Dataset[String]
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
```
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply method in Scala and col in Java.
```
val ageCol = people("age")  // in Scala
Column ageCol = people.col("age"); // in Java
```
Note that the Column type can also be manipulated through its various functions.
```
// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java
```
A more concrete example in Scala:
```
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), people("gender"))
  .agg(avg(people("salary")), max(people("age")))
```
and in Java:
```
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");

people.filter(people.col("age").gt(30))
  .join(department, people.col("deptId").equalTo(department.col("id")))
  .groupBy(department.col("name"), people.col("gender"))
  .agg(avg(people.col("salary")), max(people.col("age")));
```
Since
3.4.0
case class DatasetHolder[T] extends Product with Serializable
A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
```
val spark: SparkSession = ...
import spark.implicits._
```
Since
3.4.0
abstract class ForeachWriter[T] extends Serializable
The abstract class for writing custom logic to process data generated by a query.
The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.
- A single instance of this class is responsible of all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
- Any implementation of this class must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
- The lifecycle of the methods are as follows.
```
 For each partition with `partitionId`: For each batch/epoch of streaming data (if its
streaming query) with `epochId`: Method `open(partitionId, epochId)` is called. If `open`
returns true: For each row in the partition and batch/epoch, method `process(row)` is called.
Method `close(errorOrNull)` is called with error (if any) seen while processing rows. 
```
Important points to note:
- Spark doesn't guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reason, Spark optimization changes number of partitions, etc. Refer SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
- The close() method will be called if open() method returns successfully (irrespective of the return value), except if the JVM crashes in the middle.
Scala example:
```
datasetOfString.writeStream.foreach(new ForeachWriter[String] {

  def open(partitionId: Long, version: Long): Boolean = {
    // open connection
  }

  def process(record: String) = {
    // write string to connection
  }

  def close(errorOrNull: Throwable): Unit = {
    // close the connection
  }
})
```
Java example:
```
datasetOfString.writeStream().foreach(new ForeachWriter<String>() {

  @Override
  public boolean open(long partitionId, long version) {
    // open connection
  }

  @Override
  public void process(String value) {
    // write string to connection
  }

  @Override
  public void close(Throwable errorOrNull) {
    // close the connection
  }
});
```
Since
3.5.0
class KeyValueGroupedDataset[K, V] extends Serializable
A Dataset has been logically grouped by a user specified grouping key.
A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupByKey on an existing Dataset.
Since
3.5.0
trait LowPrioritySQLImplicits extends AnyRef
Lower priority implicit methods for converting Scala objects into Datasets.
Lower priority implicit methods for converting Scala objects into Datasets. Conflicting implicits are placed here to disambiguate resolution.
Reasons for including specific implicits: newProductEncoder - to disambiguate for Lists which are both Seq and Product
class Observation extends ObservationBase
class RelationalGroupedDataset extends AnyRef
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).
The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.
Since
3.4.0
Note
This class was named GroupedData in Spark 1.x.
class RuntimeConfig extends Logging
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use SparkSession.conf.
Since
3.4.0
abstract class SQLImplicits extends LowPrioritySQLImplicits
A collection of implicit methods for converting names and Symbols into Columns, and for converting common Scala objects into Datasets.
A collection of implicit methods for converting names and Symbols into Columns, and for converting common Scala objects into Datasets.
Since
3.4.0
class SparkSession extends Serializable with Closeable with Logging
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
```
SparkSession.builder().getOrCreate()
```
The builder can also be used to create a new session:
```
SparkSession.builder
  .remote("sc://localhost:15001/myapp")
  .getOrCreate()
```
class TypedColumn[-T, U] extends Column
A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type. To create a TypedColumn, use the as function on a Column.
T
The input type expected for this expression. Can be Any if the expression is type checked by the analyzer instead of the compiler (i.e. expr("sum(...)")).
U
The output type of this column.
Since
3.4.0
class UDFRegistration extends Logging
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use SparkSession.udf to access this:
```
spark.udf
```
Since
3.5.0
trait WriteConfigMethods[R] extends AnyRef
Configuration methods common to create/replace operations and insert/overwrite operations.
Configuration methods common to create/replace operations and insert/overwrite operations.
R
builder type to return
Since
3.4.0

Value Members

object Column
object Encoders
Methods for creating an Encoder.
Methods for creating an Encoder.
Since
3.5.0
object Observation
(Scala-specific) Create instances of Observation via Scala apply.
(Scala-specific) Create instances of Observation via Scala apply.
Since
4.0.0
object SparkSession extends Logging with Serializable
object functions
Commonly used functions available for DataFrame operations.
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.
Spark also includes more built-in functions that are less common and are not defined here. You can still access them (and all the functions defined here) using the functions.expr() API and calling them through a SQL expression string. You can find the entire list of functions at SQL API documentation of your Spark version, see also <a href="https://spark.apache.org/docs/latest/api/sql/index.html">the latest list
As an example, isnan is a function that is defined here. You can use isnan(col("myCol")) to invoke the isnan function. This way the programming language's compiler ensures isnan exists and is of the proper form. You can also use expr("isnan(myCol)") function to invoke the same function. In this case, Spark itself will ensure isnan exists when it analyzes the query.
regr_count is an example of a function that is built-in but not defined here, because it is less commonly used. To invoke it, use expr("regr_count(yCol, xCol)").
This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons.
Since
3.4.0

Packages

sql

package sql

Package Members

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

sql

package sql

Package Members

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

sql