sql

package sql

Allows the execution of relational queries, including those expressed in SQL using Spark.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

sql
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Package Members

package artifact
package catalog
package catalyst
package columnar
package connector
package execution
The physical execution component of Spark SQL.
The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.
package expressions
package internal
All classes in this package are considered an internal API to Spark and are subject to change between minor releases.
package jdbc
package scripting
package sources
A set of APIs for adding data sources to Spark SQL.
package streaming
package util

Type Members

type DataFrame = Dataset[Row]
final class DataFrameNaFunctions extends sql.api.DataFrameNaFunctions[Dataset]
Functionality for working with missing data in DataFrames.
Functionality for working with missing data in DataFrames.
Annotations
@Stable()
Since
1.3.1
class DataFrameReader extends sql.api.DataFrameReader[Dataset]
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access this.
Annotations
@Stable()
Since
1.4.0
final class DataFrameStatFunctions extends sql.api.DataFrameStatFunctions[Dataset]
Statistic functions for DataFrames.
Statistic functions for DataFrames.
Annotations
@Stable()
Since
1.4.0
class DataSourceRegistration extends Logging
Functions for registering user-defined data sources.
Functions for registering user-defined data sources. Use SparkSession.dataSource to access this.
Annotations
@Evolving()
class Dataset[T] extends sql.api.Dataset[T, Dataset]
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.
```
val people = spark.read.parquet("...").as[Person]  // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
```
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
```
val names = people.map(_.name)  // in Scala; names is a Dataset[String]
Dataset<String> names = people.map(
  (MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
```
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply method in Scala and col in Java.
```
val ageCol = people("age")  // in Scala
Column ageCol = people.col("age"); // in Java
```
Note that the Column type can also be manipulated through its various functions.
```
// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java
```
A more concrete example in Scala:
```
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), people("gender"))
  .agg(avg(people("salary")), max(people("age")))
```
and in Java:
```
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");

people.filter(people.col("age").gt(30))
  .join(department, people.col("deptId").equalTo(department.col("id")))
  .groupBy(department.col("name"), people.col("gender"))
  .agg(avg(people.col("salary")), max(people.col("age")));
```
Annotations
@Stable()
Since
1.6.0
case class DatasetHolder[T] extends Product with Serializable
A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
```
val spark: SparkSession = ...
import spark.implicits._
```
Annotations
@Stable()
Since
1.6.0
class ExperimentalMethods extends AnyRef
:: Experimental :: Holder for experimental methods for the bravest.
:: Experimental :: Holder for experimental methods for the bravest. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
```
spark.experimental.extraStrategies += ...
```
Annotations
@Experimental() @Unstable()
Since
1.3.0
trait ExtendedExplainGenerator extends AnyRef
A trait for a session extension to implement that provides addition explain plan information.
A trait for a session extension to implement that provides addition explain plan information.
Annotations
@DeveloperApi() @Since("4.0.0")
class KeyValueGroupedDataset[K, V] extends sql.api.KeyValueGroupedDataset[K, V, Dataset]
A Dataset has been logically grouped by a user specified grouping key.
A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupByKey on an existing Dataset.
Since
2.0.0
trait LowPrioritySQLImplicits extends AnyRef
Lower priority implicit methods for converting Scala objects into Datasets.
Lower priority implicit methods for converting Scala objects into Datasets. Conflicting implicits are placed here to disambiguate resolution.
Reasons for including specific implicits: newProductEncoder - to disambiguate for Lists which are both Seq and Product
class RelationalGroupedDataset extends sql.api.RelationalGroupedDataset[Dataset]
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).
The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.
Annotations
@Stable()
Since
2.0.0
Note
This class was named GroupedData in Spark 1.x.
class RuntimeConfig extends AnyRef
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use SparkSession.conf.
Options set here are automatically propagated to the Hadoop configuration during I/O.
Annotations
@Stable()
Since
2.0.0
class SQLContext extends Logging with Serializable
The entry point for working with structured data (rows and columns) in Spark 1.x.
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
Annotations
@Stable()
Since
1.0.0
abstract class SQLImplicits extends LowPrioritySQLImplicits
A collection of implicit methods for converting common Scala objects into Datasets.
A collection of implicit methods for converting common Scala objects into Datasets.
Since
1.6.0
class SparkSession extends sql.api.SparkSession[Dataset] with Logging
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
```
SparkSession.builder().getOrCreate()
```
The builder can also be used to create a new session:
```
SparkSession.builder
  .master("local")
  .appName("Word Count")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()
```
Annotations
@Stable()
class SparkSessionExtensions extends AnyRef
:: Experimental :: Holder for injection points to the SparkSession.
:: Experimental :: Holder for injection points to the SparkSession. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
This current provides the following extension points:
- Analyzer Rules.
- Check Analysis Rules.
- Cache Plan Normalization Rules.
- Optimizer Rules.
- Pre CBO Rules.
- Planning Strategies.
- Customized Parser.
- (External) Catalog listeners.
- Columnar Rules.
- Adaptive Query Post Planner Strategy Rules.
- Adaptive Query Stage Preparation Rules.
- Adaptive Query Execution Runtime Optimizer Rules.
- Adaptive Query Stage Optimizer Rules.
The extensions can be used by calling withExtensions on the SparkSession.Builder, for example:
```
SparkSession.builder()
  .master("...")
  .config("...", true)
  .withExtensions { extensions =>
    extensions.injectResolutionRule { session =>
      ...
    }
    extensions.injectParser { (session, parser) =>
      ...
    }
  }
  .getOrCreate()
```
The extensions can also be used by setting the Spark SQL configuration property spark.sql.extensions. Multiple extensions can be set using a comma-separated list. For example:
```
SparkSession.builder()
  .master("...")
  .config("spark.sql.extensions", "org.example.MyExtensions,org.example.YourExtensions")
  .getOrCreate()

class MyExtensions extends Function1[SparkSessionExtensions, Unit] {
  override def apply(extensions: SparkSessionExtensions): Unit = {
    extensions.injectResolutionRule { session =>
      ...
    }
    extensions.injectParser { (session, parser) =>
      ...
    }
  }
}

class YourExtensions extends SparkSessionExtensionsProvider {
  override def apply(extensions: SparkSessionExtensions): Unit = {
    extensions.injectResolutionRule { session =>
      ...
    }
    extensions.injectFunction(...)
  }
}
```
Note that none of the injected builders should assume that the SparkSession is fully initialized and should not touch the session's internals (e.g. the SessionState).
Annotations
@DeveloperApi() @Experimental() @Unstable()

trait SparkSessionExtensionsProvider extends (SparkSessionExtensions) => Unit

:: Unstable ::

Base trait for implementations used by SparkSessionExtensions

For example, now we have an external function named Age to register as an extension for SparkSession:

package org.apache.spark.examples.extensions

import org.apache.spark.sql.catalyst.expressions.{CurrentDate, Expression, RuntimeReplaceable, SubtractDates}

case class Age(birthday: Expression, child: Expression) extends RuntimeReplaceable {

  def this(birthday: Expression) = this(birthday, SubtractDates(CurrentDate(), birthday))
  override def exprsReplaced: Seq[Expression] = Seq(birthday)
  override protected def withNewChildInternal(newChild: Expression): Expression = copy(newChild)
}

We need to create our extension which inherits SparkSessionExtensionsProvider Example:

package org.apache.spark.examples.extensions

import org.apache.spark.sql.{SparkSessionExtensions, SparkSessionExtensionsProvider}
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionInfo}

class MyExtensions extends SparkSessionExtensionsProvider {
  override def apply(v1: SparkSessionExtensions): Unit = {
    v1.injectFunction(
      (new FunctionIdentifier("age"),
        new ExpressionInfo(classOf[Age].getName, "age"),
        (children: Seq[Expression]) => new Age(children.head)))
  }
}

Then, we can inject MyExtensions in three ways,

withExtensions of SparkSession.Builder
Config - spark.sql.extensions
java.util.ServiceLoader - Add to src/main/resources/META-INF/services/org.apache.spark.sql.SparkSessionExtensionsProvider

Annotations: @DeveloperApi() @Unstable() @Since("3.2.0")
Since: 3.2.0
See also: SparkSessionExtensions
SparkSession.Builder
java.util.ServiceLoader

type Strategy = SparkStrategy
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
Annotations
@DeveloperApi() @Unstable()
class UDFRegistration extends sql.api.UDFRegistration with Logging
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use SparkSession.udf to access this:
```
spark.udf
```
Annotations
@Stable()
Since
1.3.0
class UDTFRegistration extends Logging
Functions for registering user-defined table functions.
Functions for registering user-defined table functions. Use SparkSession.udtf to access this.
Annotations
@Evolving()
Since
3.5.0

Value Members

object SQLContext extends Serializable
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
It also provides utility functions to support preference for threads in multiple sessions scenario, setActive could set a SQLContext for current thread, which will be returned by getOrCreate instead of the global one.
object SparkSession extends Logging with Serializable
Annotations
@Stable()

Packages

sql

package sql

Package Members

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

sql

package sql

Package Members

Type Members

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

sql