Applies an action to the underlying Dataset.
Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))
2.0.0
Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
2.0.0
Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg("age" -> "max", "salary" -> "avg") ds.groupBy().agg("age" -> "max", "salary" -> "avg")
2.0.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set. Same as as
.
2.0.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set. Same as as
.
2.0.0
:: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type.
:: Experimental :: Returns a new Dataset where each record has been
mapped on to the specified type. The method used to map columns
depend on the type of U
:
U
is a class, fields for the class will be mapped to
columns of the same name (case sensitivity is determined by
spark.sql.caseSensitive
).U
is a tuple, the columns will be mapped by ordinal
(i.e. the first column will be assigned to _1
).U
is a primitive type (i.e. String, Int, etc), then the
first column of the DataFrame
will be used.If the schema of the Dataset does not match the desired U
type,
you can use select
along with alias
or as
to rearrange or
rename as required.
Note that as[]
only changes the view of the data that is passed
into typed operations, such as map()
, and does not eagerly
project away any columns that are not present in the specified
class.
1.6.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set.
2.0.0
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set.
1.6.0
Persist this Dataset with the default storage level
(MEMORY_AND_DISK
).
Persist this Dataset with the default storage level
(MEMORY_AND_DISK
).
1.6.0
Returns a checkpointed version of this Dataset.
Returns a checkpointed version of this Dataset. Checkpointing can
be used to truncate the logical plan of this Dataset, which is
especially useful in iterative algorithms where the plan may grow
exponentially. It will be saved to files inside the checkpoint
directory set with SparkContext#setCheckpointDir
.
2.1.0
Eagerly checkpoint a Dataset and return the new Dataset.
Eagerly checkpoint a Dataset and return the new Dataset.
Checkpointing can be used to truncate the logical plan of this
Dataset, which is especially useful in iterative algorithms where
the plan may grow exponentially. It will be saved to files inside
the checkpoint directory set with SparkContext#setCheckpointDir
.
2.1.0
Returns a new Dataset that has exactly numPartitions
partitions,
when the fewer partitions are requested.
Returns a new Dataset that has exactly numPartitions
partitions,
when the fewer partitions are requested. If a larger number of
partitions is requested, it will stay at the current number of
partitions. Similar to coalesce defined on an RDD
, this operation
results in a narrow dependency, e.g. if you go from 1000 partitions
to 100 partitions, there will not be a shuffle, instead each of the
100 new partitions will claim 10 of the current partitions.
However, if you're doing a drastic coalesce, e.g. to
numPartitions = 1
, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of
numPartitions = 1
). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
numPartitions = 1 }}} This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
numPartitions = 1 }}} place on fewer nodes than you like (e.g. one node in the case of
numPartitions = 1
). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
numPartitions = 1 }}} This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
1.6.0
Selects column based on the column name and returns it as a Column.
Selects column based on the column name and returns it as a Column.
2.0.0
The column name can also reference to a nested column like a.b
.
Selects column based on the column name specified as a regex and returns it as Column.
Selects column based on the column name specified as a regex and returns it as Column.
2.3.0
Returns an array that contains all rows in this Dataset.
Returns an array that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.6.0
Returns all column names as an array.
Returns all column names as an array.
1.6.0
Returns the number of rows in the Dataset.
Returns the number of rows in the Dataset.
1.6.0
Creates a global temporary view using the given name.
Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the
lifetime of the Spark application,
i.e. it will be automatically dropped when the application
terminates. It's tied to a system preserved database global_temp
,
and we must use the qualified name to refer a global temp view,
e.g. SELECT * FROM global_temp.view1
.
2.1.0
AnalysisException
if the view name is invalid or already exists
Creates or replaces a global temporary view using the given name.
Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the
lifetime of the Spark application,
i.e. it will be automatically dropped when the application
terminates. It's tied to a system preserved database global_temp
,
and we must use the qualified name to refer a global temp view,
e.g. SELECT * FROM global_temp.view1
.
2.2.0
Creates a local temporary view using the given name.
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
2.0.0
Creates a local temporary view using the given name.
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
Local temporary view is session-scoped. Its lifetime is the
lifetime of the session that created it, i.e. it will be
automatically dropped when the session terminates. It's not tied to
any databases, i.e. we can't use db1.view1
to reference a local
temporary view.
2.0.0
AnalysisException
if the view name is invalid or already exists
Explicit cartesian join with another DataFrame
.
Explicit cartesian join with another DataFrame
.
Right side of the join operation.
2.1.0
Cartesian joins are very expensive without an extra filter that can be pushed down.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the
resulting Dataset. If you want to programmatically compute summary
statistics, use the agg
function instead.
ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
Use summary for expanded statistics and control over which statistics to compute.
Columns to compute statistics on.
1.6.0
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this
Dataset. This is an alias for dropDuplicates
.
2.0.0
Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom
equals
function defined on T
.
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.
2.0.0
Returns a new Dataset with columns dropped.
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
2.0.0
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
2.0.0
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
2.0.0
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
2.0.0
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this
Dataset. This is an alias for distinct
.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
2.0.0
Returns all column names and their data types as an array.
Returns all column names and their data types as an array.
1.6.0
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in
another Dataset. This is equivalent to EXCEPT DISTINCT
in SQL.
2.0.0
Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom
equals
function defined on T
.
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates.
Returns a new Dataset containing rows in this Dataset but not in
another Dataset while preserving the duplicates. This is equivalent
to EXCEPT ALL
in SQL.
2.4.0
Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom
equals
function defined on T
. Also as standard in SQL, this
function resolves columns by position (not by name).
Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.
1.6.0
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
1.6.0
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.filter("age > 15")
1.6.0
Filters rows using the given condition.
Filters rows using the given condition.
// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
1.6.0
:: Experimental :: (Scala-specific) Returns a new Dataset that only
contains elements where func
returns true
.
:: Experimental :: (Scala-specific) Returns a new Dataset that only
contains elements where func
returns true
.
1.6.0
Returns the first row.
Returns the first row. Alias for head().
1.6.0
Alias for headOption.
:: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
:: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
1.6.0
Applies a function f
to all rows.
Applies a function f
to all rows.
1.6.0
Applies a function f
to each partition of this Dataset.
Applies a function f
to each partition of this Dataset.
1.6.0
Applies an action to the underlying Dataset.
Applies an action to the underlying Dataset, it is used for transformations that can fail due to an AnalysisException.
Transforms the Dataset into a RelationalGroupedDataset.
Groups the Dataset using the specified columns, so we ca run aggregations on them.
Groups the Dataset using the specified columns, so we ca run aggregations on them.
See UnderlyingDataset.groupBy for more information.
:: Experimental :: (Scala-specific) Returns a
KeyValueGroupedDataset where the data is grouped by the given
key func
.
:: Experimental :: (Scala-specific) Returns a
KeyValueGroupedDataset where the data is grouped by the given
key func
.
2.0.0
Returns the first row.
Returns the first row.
1.6.0
Returns the first n
rows.
Returns the first n
rows.
1.6.0
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Takes the first element of a dataset or None.
Specifies some hint on the current Dataset.
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:
df1.join(df2.hint("broadcast"))
2.2.0
Returns a best-effort snapshot of the files that compose this Dataset.
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
2.0.0
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Returns a new Dataset containing rows only in both this Dataset and
another Dataset. This is equivalent to INTERSECT
in SQL.
1.6.0
Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom
equals
function defined on T
.
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates.
Returns a new Dataset containing rows only in both this Dataset and
another Dataset while preserving the duplicates. This is equivalent
to INTERSECT ALL
in SQL.
2.4.0
Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom
equals
function defined on T
. Also as standard in SQL, this
function resolves columns by position (not by name).
Returns true if the Dataset
is empty.
Returns true if the Dataset
is empty.
2.4.0
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
1.6.0
Returns true if this Dataset contains one or more sources that continuously return data as it arrives.
Returns true if this Dataset contains one or more sources that
continuously return data as it arrives. A Dataset that reads data
from a streaming source must be executed as a StreamingQuery
using the start()
method in DataStreamWriter
. Methods that
return a single answer, e.g. count()
or collect()
, will throw
an AnalysisException when there is a streaming source present.
2.0.0
Join with another DataFrame
, using the given join expression.
Join with another DataFrame
, using the given join expression. The
following performs a full outer join between df1
and df2
.
// Scala: import org.apache.spark.sql.functions._ df1.join(df2, $"df1Key" === $"df2Key", "outer") // Java: import static org.apache.spark.sql.functions.*; df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
Right side of the join.
Join expression.
Type of join to perform. Default inner
. Must be one of:
inner
, cross
, outer
, full
, full_outer
, left
,
left_outer
, right
, right_outer
, left_semi
, left_anti
.
2.0.0
Inner join with another DataFrame
, using the given join
expression.
Inner join with another DataFrame
, using the given join
expression.
// The following two are equivalent: df1.join(df2, $"df1Key" === $"df2Key") df1.join(df2).where($"df1Key" === $"df2Key")
2.0.0
Equi-join with another DataFrame
using the given columns.
Equi-join with another DataFrame
using the given columns. A cross
join with a predicate is specified as an inner join. If you would
explicitly like to perform a cross join use the crossJoin
method.
Different from other join functions, the join columns will only
appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
Type of join to perform. Default inner
. Must be one of:
inner
, cross
, outer
, full
, full_outer
, left
,
left_outer
, right
, right_outer
, left_semi
, left_anti
.
2.0.0
If you perform a self-join using this function without aliasing
the input DataFrame
s, you will NOT be able to reference any
columns after the join, since there is no way to disambiguate
which side of the join you would like to reference.
Inner equi-join with another DataFrame
using the given columns.
Inner equi-join with another DataFrame
using the given columns.
Different from other join functions, the join columns will only
appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name" df1.join(df2, Seq("user_id", "user_name"))
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
2.0.0
If you perform a self-join using this function without aliasing
the input DataFrame
s, you will NOT be able to reference any
columns after the join, since there is no way to disambiguate
which side of the join you would like to reference.
Inner equi-join with another DataFrame
using the given column.
Inner equi-join with another DataFrame
using the given column.
Different from other join functions, the join column will only
appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id" df1.join(df2, "user_id")
Right side of the join operation.
Name of the column to join on. This column must exist on both sides.
2.0.0
If you perform a self-join using this function without aliasing
the input DataFrame
s, you will NOT be able to reference any
columns after the join, since there is no way to disambiguate
which side of the join you would like to reference.
Join with another DataFrame
.
Join with another DataFrame
.
Behaves as an INNER JOIN and requires a subsequent join predicate.
Right side of the join operation.
2.0.0
:: Experimental :: Using inner equi-join to join this Dataset
returning a Tuple2
for each pair where condition
evaluates to
true.
:: Experimental :: Using inner equi-join to join this Dataset
returning a Tuple2
for each pair where condition
evaluates to
true.
Right side of the join.
Join expression.
1.6.0
:: Experimental :: Joins this Dataset returning a Tuple2
for each
pair where condition
evaluates to true.
:: Experimental :: Joins this Dataset returning a Tuple2
for each
pair where condition
evaluates to true.
This is similar to the relation join
function with one important
difference in the result schema. Since joinWith
preserves objects
present on either side of the join, the result schema is similarly
nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
Right side of the join.
Join expression.
Type of join to perform. Default inner
. Must be one of:
inner
, cross
, outer
, full
, full_outer
, left
,
left_outer
, right
, right_outer
.
1.6.0
Returns a new Dataset by taking the first n
rows.
Returns a new Dataset by taking the first n
rows. The difference
between this function and head
is that head
is an action and
returns an array (by triggering query execution) while limit
returns a new Dataset.
2.0.0
Locally checkpoints a Dataset and return the new Dataset.
Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.
2.3.0
Eagerly locally checkpoints a Dataset and return the new Dataset.
Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.
2.3.0
:: Experimental :: (Scala-specific) Returns a new Dataset that
contains the result of applying func
to each element.
:: Experimental :: (Scala-specific) Returns a new Dataset that
contains the result of applying func
to each element.
1.6.0
:: Experimental :: (Scala-specific) Returns a new Dataset that
contains the result of applying func
to each partition.
:: Experimental :: (Scala-specific) Returns a new Dataset that
contains the result of applying func
to each partition.
1.6.0
Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
// Dropping rows containing any null values.
ds.na.drop()
1.6.0
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an
alias of the sort
function.
2.0.0
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an
alias of the sort
function.
2.0.0
Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.
1.6.0
Persist this Dataset with the default storage level
(MEMORY_AND_DISK
).
Persist this Dataset with the default storage level
(MEMORY_AND_DISK
).
1.6.0
Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.
1.6.0
Transform the dataset into a RDD.
Transform the dataset into a RDD.
See UnderlyingDataset.rdd for more information.
:: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function.
:: Experimental :: (Scala-specific) Reduces the elements of this
Dataset using the specified binary function. The given func
must
be commutative and associative or the result may be
non-deterministic.
1.6.0
Returns a new Dataset partitioned by the given partitioning
expressions, using spark.sql.shuffle.partitions
as number of
partitions.
Returns a new Dataset partitioned by the given partitioning
expressions, using spark.sql.shuffle.partitions
as number of
partitions. The resulting Dataset is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset partitioned by the given partitioning
expressions into numPartitions
.
Returns a new Dataset partitioned by the given partitioning
expressions into numPartitions
. The resulting Dataset is hash
partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset that has exactly numPartitions
partitions.
1.6.0
Returns a new Dataset partitioned by the given partitioning
expressions, using spark.sql.shuffle.partitions
as number of
partitions.
Returns a new Dataset partitioned by the given partitioning
expressions, using spark.sql.shuffle.partitions
as number of
partitions. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
2.3.0
Returns a new Dataset partitioned by the given partitioning
expressions into numPartitions
.
Returns a new Dataset partitioned by the given partitioning
expressions into numPartitions
. The resulting Dataset is range
partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
2.3.0
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
2.0.0
Returns a new Dataset by sampling a fraction of rows, using a random seed.
Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
Sample with replacement or not.
Fraction of rows to generate, range [0.0, 1.0].
Seed for sampling.
1.6.0
This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset.
Returns a new Dataset by sampling a fraction of rows (without replacement), using a random seed.
Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.
Returns the schema of this Dataset.
Returns the schema of this Dataset.
1.6.0
Selects a set of columns.
Selects a set of columns. This is a variant of select
that can
only select existing columns using column names (i.e. cannot
construct expressions).
// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")
2.0.0
Selects a set of column based expressions.
Selects a set of column based expressions.
ds.select($"colA", $"colB" + 1)
2.0.0
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
1.6.0
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
1.6.0
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
1.6.0
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.
1.6.0
:: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
:: Experimental :: Returns a new Dataset by computing the given Column expression for each element.
val ds = Seq(1, 2, 3).toDS() val newDS = ds.select(expr("value + 1").as[Int])
1.6.0
Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of select
that accepts SQL expressions.
// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
2.0.0
Displays the top rows of Dataset in a tabular form.
Displays the top rows of Dataset in a tabular form.
See UnderlyingDataset.show for more information.
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
See UnderlyingDataset.show for more information.
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings with more than 20 characters will be truncated.
See UnderlyingDataset.show for more information.
Displays the top rows of Dataset in a tabular form.
Displays the top rows of Dataset in a tabular form. Strings with more than 20 characters will be truncated.
See UnderlyingDataset.show for more information.
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
2.0.0
Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the specified column, all in ascending order.
// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)
2.0.0
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
2.0.0
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
2.0.0
Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
// Finding frequent items in column with name 'a'. ds.stat.freqItems(Seq("a"))
1.6.0
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
2.1.0
Computes specified statistics for numeric and string columns.
Computes specified statistics for numeric and string columns. Available statistics are:
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the
resulting Dataset. If you want to programmatically compute summary
statistics, use the agg
function instead.
ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
See also describe for basic statistics.
Statistics from above list to be computed.
2.3.0
Computes specified statistics for numeric and string columns.
Computes specified statistics for numeric and string columns.
See UnderlyingDataset.summary for more information.
Returns the first n
rows in the Dataset.
Returns the first n
rows in the Dataset.
Running take requires moving data into the application's driver
process, and doing so with a very large n
can crash the driver
process with OutOfMemoryError.
1.6.0
Converts this strongly typed collection of data to generic
DataFrame
with columns renamed.
Converts this strongly typed collection of data to generic
DataFrame
with columns renamed. This can be quite convenient in
conversion from an RDD of tuples into a DataFrame
with meaningful
names. For example:
val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
2.0.0
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
1.6.0
Returns the content of the Dataset as a Dataset of JSON strings.
Returns the content of the Dataset as a Dataset of JSON strings.
2.0.0
Returns an iterator that contains all rows in this Dataset.
Returns an iterator that contains all rows in this Dataset.
The iterator will consume as much memory as the largest partition in this Dataset.
2.0.0
this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
Chains custom transformations.
Chains custom transformations.
See UnderlyingDataset.transform for more information.
Applies a transformation to the underlying Dataset.
Applies a transformation to the underlying Dataset, it is used for transformations that can fail due to an AnalysisException.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function
followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.union(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 4| 5| 6| // +----+----+----+
Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects.
2.0.0
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is different from both UNION ALL
and UNION DISTINCT
in
SQL. To do a SQL-style set union (that does deduplication of
elements), use this function followed by a distinct.
The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.unionByName(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 6| 4| 5| // +----+----+----+
2.3.0
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
1.6.0
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
Whether to block until all blocks are deleted.
1.6.0
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk in a blocking way.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk in a blocking way.
See UnderlyingDataset.unpersist for more information.
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.where("age > 15")
1.6.0
Filters rows using the given condition.
Filters rows using the given condition. This is an alias for
filter
.
// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
1.6.0
Alias for filter.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
column
's expression must only refer to attributes supplied by
this Dataset. It is an error to add a column that refers to some
other Dataset.
2.0.0
Returns a new Dataset with a column renamed.
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
2.0.0
Defines an event time watermark for this Dataset.
Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.
Spark will use this watermark for several purposes:
mapGroupsWithState
and
dropDuplicates
operators.The current watermark is computed by looking at the
MAX(eventTime)
seen across all of the partitions in the query
minus a user specified delayThreshold
. Due to the cost of
coordinating this value across partitions, the actual watermark
used is only guaranteed to be at least delayThreshold
behind the
actual event time. In some cases we may still process records that
arrive more than delayThreshold
late.
the name of the column that contains the event time of the row.
the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.
2.1.0
Create a DataFrameWriter from this dataset.
Create a DataStreamWriter from this dataset.
Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.
Returns a new Dataset where a single column has been expanded to
zero or more rows by the provided function. This is similar to a
LATERAL VIEW
in HiveQL. All columns of the input row are
implicitly joined with each value that is output by the function.
Given that this is deprecated, as an alternative, you can explode
columns either using functions.explode()
:
ds.select(explode(split('words, " ")).as("word"))
or flatMap()
:
ds.flatMap(_.words.split(" "))
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
2.0.0
Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.
Returns a new Dataset where each row has been expanded to zero or
more rows by the provided function. This is similar to a LATERAL
VIEW
in HiveQL. The columns of the input row are implicitly joined
with each row that is output by the function.
Given that this is deprecated, as an alternative, you can explode
columns either using functions.explode()
or flatMap()
. The
following example uses these alternatives to count the number of
books that contain a given word:
case class Book(title: String, words: String) val ds: Dataset[Book] val allWords = ds.select('title, explode(split('words, " ")).as("word")) val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
Using flatMap()
this can similarly be exploded as:
ds.flatMap(_.words.split(" "))
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
2.0.0
Registers this Dataset as a temporary table using the given name.
Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.
(Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.
1.6.0
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL
in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function
followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
(Since version 2.0.0) use union()
2.0.0