package connect
- Alphabetic
- By Inheritance
- connect
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Type Members
- class Catalog extends catalog.Catalog
- trait ConnectConversions extends AnyRef
Conversions from sql interfaces to the Connect specific implementation.
Conversions from sql interfaces to the Connect specific implementation.
This class is mainly used by the implementation. It is also meant to be used by extension developers.
We provide both a trait and an object. The trait is useful in situations where an extension developer needs to use these conversions in a project covering multiple Spark versions. They can create a shim for these conversions, the Spark 4+ version of the shim implements this trait, and shims for older versions do not.
- Annotations
- @DeveloperApi()
- type DataFrame = Dataset[Row]
- final class DataFrameNaFunctions extends sql.DataFrameNaFunctions
Functionality for working with missing data in
DataFrames.Functionality for working with missing data in
DataFrames.- Since
3.4.0
- class DataFrameReader extends sql.DataFrameReader
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use
SparkSession.readto access this.- Annotations
- @Stable()
- Since
3.4.0
- final class DataFrameStatFunctions extends sql.DataFrameStatFunctions
Statistic functions for
DataFrames.Statistic functions for
DataFrames.- Since
3.4.0
- final class DataFrameWriter[T] extends sql.DataFrameWriter[T]
Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use
Dataset.writeto access this.- Annotations
- @Stable()
- Since
3.4.0
- final class DataFrameWriterV2[T] extends sql.DataFrameWriterV2[T]
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
Interface used to write a org.apache.spark.sql.Dataset to external storage using the v2 API.
- Annotations
- @Experimental()
- Since
3.4.0
- final class DataStreamReader extends streaming.DataStreamReader
Interface used to load a streaming
Datasetfrom external storage systems (e.g.Interface used to load a streaming
Datasetfrom external storage systems (e.g. file systems, key-value stores, etc). UseSparkSession.readStreamto access this.- Annotations
- @Evolving()
- Since
3.5.0
- final class DataStreamWriter[T] extends streaming.DataStreamWriter[T]
Interface used to write a streaming
Datasetto external storage systems (e.g.Interface used to write a streaming
Datasetto external storage systems (e.g. file systems, key-value stores, etc). UseDataset.writeStreamto access this.- Annotations
- @Evolving()
- Since
3.5.0
- class Dataset[T] extends sql.Dataset[T]
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a
DataFrame, which is a Dataset of Row.Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (
groupBy). Example actions count, show, or writing data out to file systems.Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the
explainfunction.To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type
Tto Spark's internal type system. For example, given a classPersonwith two fields,name(string) andage(int), an encoder is used to tell Spark to generate code at runtime to serialize thePersonobject into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use theschemafunction.There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the
readfunction available on aSparkSession.val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use
applymethod in Scala andcolin Java.val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age")));
- Since
3.4.0
- class DatasetHolder[U] extends sql.DatasetHolder[U]
- class KeyValueGroupedDataset[K, V] extends sql.KeyValueGroupedDataset[K, V]
A Dataset has been logically grouped by a user specified grouping key.
A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call
groupByKeyon an existing Dataset.- Since
3.5.0
- class MergeIntoWriter[T] extends sql.MergeIntoWriter[T]
MergeIntoWriterprovides methods to define and execute merge actions based on specified conditions.MergeIntoWriterprovides methods to define and execute merge actions based on specified conditions.- T
the type of data in the Dataset.
- Annotations
- @Experimental()
- Since
4.0.0
- case class ProtoColumnNode(expr: Expression, origin: Origin = CurrentOrigin.get) extends ColumnNode with Product with Serializable
- class RelationalGroupedDataset extends sql.RelationalGroupedDataset
A set of methods for aggregations on a
DataFrame, created by groupBy, cube or rollup (and alsopivot).A set of methods for aggregations on a
DataFrame, created by groupBy, cube or rollup (and alsopivot).The main method is the
aggfunction, which has multiple variants. This class also contains some first-order statistics such asmean,sumfor convenience.- Since
3.4.0
- Note
This class was named
GroupedDatain Spark 1.x.
- class RemoteStreamingQuery extends StreamingQuery
- class RuntimeConfig extends sql.RuntimeConfig with Logging
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use
SparkSession.conf.- Since
3.4.0
- class SQLContext extends sql.SQLContext
- Annotations
- @Stable()
- abstract class SQLImplicits extends sql.SQLImplicits
<invalid inheritdoc annotation>
- class SparkSession extends sql.SparkSession with Logging
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
SparkSession.builder().getOrCreate()
The builder can also be used to create a new session:
SparkSession.builder .remote("sc://localhost:15001/myapp") .getOrCreate() - trait StreamingQuery extends streaming.StreamingQuery
<invalid inheritdoc annotation>
- class StreamingQueryListenerBus extends Logging
- class StreamingQueryManager extends streaming.StreamingQueryManager with Logging
A class to manage all the StreamingQuery active in a
SparkSession.A class to manage all the StreamingQuery active in a
SparkSession.- Annotations
- @Evolving()
- Since
3.5.0
- case class SubqueryExpressionNode(relation: Relation, subqueryType: SubqueryType, origin: Origin = CurrentOrigin.get) extends ColumnNode with Product with Serializable
- sealed trait SubqueryType extends AnyRef
- class TableValuedFunction extends sql.TableValuedFunction
- class UDFRegistration extends sql.UDFRegistration
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use
SparkSession.udfto access this:spark.udf
- Since
3.5.0
Value Members
- object ColumnNodeToProtoConverter extends (ColumnNode) => Expression
Converter for ColumnNode to proto.Expression conversions.
- object ConnectConversions extends ConnectConversions
- object ConnectProtoUtils
Utility functions for parsing Spark Connect protocol buffers with a recursion limit.
Utility functions for parsing Spark Connect protocol buffers with a recursion limit. This is intended to be used by plugins, as they cannot use
ProtoUtils.parseWithRecursionLimitdue to the shading of thecom.google.protobufpackage. - object RemoteStreamingQuery
- object SQLContext extends SQLContextCompanion with Serializable
- object SparkSession extends SparkSessionCompanion with Logging with Serializable
- object SubqueryType