:: Experimental :: A column that will be computed based on the data in a DataFrame.
:: Experimental :: A column that will be computed based on the data in a DataFrame.
A new column is constructed based on the input columns present in a dataframe:
df("columnName") // On a specific DataFrame. col("columnName") // A generic column no yet associated with a DataFrame. col("columnName.field") // Extracting a struct field col("`a.column.with.dots`") // Escape `.` in column names. $"columnName" // Scala short hand for a named column. expr("a + 1") // A column that is constructed from a parsed SQL Expression. lit("abc") // A column that produces a literal (constant) value.
Column objects can be composed to form complex expressions:
$"a" + 1 $"a" === $"b"
1.3.0
:: Experimental :: A convenient class used for constructing schema.
:: Experimental :: A convenient class used for constructing schema.
1.3.0
:: Experimental :: A handle to a query that is executing continuously in the background as new data arrives.
:: Experimental :: A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
2.0.0
:: Experimental :: Exception that stopped a ContinuousQuery.
:: Experimental :: Exception that stopped a ContinuousQuery.
2.0.0
:: Experimental :: A class to manage all the ContinuousQueries active on a SparkSession.
:: Experimental :: A class to manage all the ContinuousQueries active on a SparkSession.
2.0.0
:: Experimental :: Functionality for working with missing data in DataFrames.
:: Experimental :: Functionality for working with missing data in DataFrames.
1.3.1
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc) or data streams. Use SparkSession.read to access this.
1.4.0
:: Experimental :: Statistic functions for DataFrames.
:: Experimental :: Statistic functions for DataFrames.
1.4.0
Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc) or data streams. Use Dataset.write to access this.
1.4.0
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy
).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
a Dataset represents a logical plan that describes the computation required to produce the data.
When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
physical plan for efficient execution in a parallel and distributed manner. To explore the
logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps
the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate
code at runtime to serialize the Person
object into a binary structure. This binary structure
often has much lower memory footprint as well as are optimized for efficiency in data processing
(e.g. in a columnar format). To understand the internal binary representation for data, use the
schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark
to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class) // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING)) // in Java 8
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala Column ageCol = people.col("age") // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SQLContext val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), "gender") .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Datasetusing SQLContext
Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter("age".gt(30)) .join(department, people.col("deptId").equalTo(department("id"))) .groupBy(department.col("name"), "gender") .agg(avg(people.col("salary")), max(people.col("age")));
1.6.0
A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
import sqlContext.implicits._
1.6.0
:: Experimental :: Holder for experimental methods for the bravest.
:: Experimental :: Holder for experimental methods for the bravest. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
spark.experimental.extraStrategies += ...
1.3.0
:: Experimental :: A Dataset has been logically grouped by a user specified grouping key.
:: Experimental ::
A Dataset has been logically grouped by a user specified grouping key. Users should not
construct a KeyValueGroupedDataset directly, but should instead call groupBy
on an existing
Dataset.
2.0.0
:: Experimental :: A trigger that runs a query periodically based on the processing time.
:: Experimental ::
A trigger that runs a query periodically based on the processing time. If interval
is 0,
the query will run as fast as possible.
Scala Example:
df.write.trigger(ProcessingTime("10 seconds")) import scala.concurrent.duration._ df.write.trigger(ProcessingTime(10.seconds))
Java Example:
df.write.trigger(ProcessingTime.create("10 seconds")) import java.util.concurrent.TimeUnit df.write.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
A set of methods for aggregations on a DataFrame, created by Dataset.groupBy.
A set of methods for aggregations on a DataFrame, created by Dataset.groupBy.
The main method is the agg function, which has multiple variants. This class also contains convenience some first order statistics such as mean, sum for convenience.
2.0.0
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use SparkSession.conf.
Options set here are automatically propagated to the Hadoop configuration during I/O.
2.0.0
The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
1.0.0
A collection of implicit methods for converting common Scala objects into DataFrames.
A collection of implicit methods for converting common Scala objects into DataFrames.
1.6.0
:: Experimental :: Status and metrics of a streaming Sink.
:: Experimental :: Status and metrics of a streaming Sink.
2.0.0
:: Experimental :: Status and metrics of a streaming Source.
:: Experimental :: Status and metrics of a streaming Source.
2.0.0
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
To create a SparkSession, use the following builder pattern:
SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value"). .getOrCreate()
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
:: Experimental :: Used to indicate how often results should be produced by a ContinuousQuery.
:: Experimental :: Used to indicate how often results should be produced by a ContinuousQuery.
A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type.
To create a TypedColumn, use the as
function on a Column.
The input type expected for this expression. Can be Any
if the expression is type
checked by the analyzer instead of the compiler (i.e. expr("sum(...)")
).
The output type of this column.
1.6.0
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use SQLContext.udf to access this.
1.3.0
:: Experimental :: Used to create ProcessingTime triggers for ContinuousQuerys.
:: Experimental :: Used to create ProcessingTime triggers for ContinuousQuerys.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
It also provides utility functions to support preference for threads in multiple sessions scenario, setActive could set a SQLContext for current thread, which will be returned by getOrCreate instead of the global one.
Contains API classes that are specific to a single language (i.e.
Contains API classes that are specific to a single language (i.e. Java).
The physical execution component of Spark SQL.
The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.
:: Experimental :: Functions available for DataFrame.
:: Experimental :: Functions available for DataFrame.
1.3.0
All classes in this package are considered an internal API to Spark and are subject to change between minor releases.
A set of APIs for adding data sources to Spark SQL.
Allows the execution of relational queries, including those expressed in SQL using Spark.