Persist this TypedDataset with the default storage level (MEMORY_AND_DISK
).
Persist this TypedDataset with the default storage level (MEMORY_AND_DISK
).
apache/spark
Returns a new TypedDataset that has exactly numPartitions
partitions.
Returns a new TypedDataset that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
apache/spark
Returns an Array
that contains all column names in this TypedDataset.
Methods on TypedDataset[T]
that go through a full serialization and
deserialization of T
, and execute outside of the Catalyst runtime.
Methods on TypedDataset[T]
that go through a full serialization and
deserialization of T
, and execute outside of the Catalyst runtime.
The correct way to do a projection on a single column is to
use the select
method as follows:
ds: TypedDataset[(String, String, String)] -> ds.select(ds('_2)).run()
Spark provides an alternative way to obtain the same resulting Dataset
,
using the map
method:
ds: TypedDataset[(String, String, String)] -> ds.deserialized.map(_._2).run()
This second approach is however substantially slower than the first one,
and should be avoided as possible. Indeed, under the hood this map
will
deserialize the entire Tuple3
to an full JVM object, call the apply
method of the _._2
closure on it, and serialize the resulting String back
to its Catalyst representation.
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This is equivalent to EXCEPT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
apache/spark
Returns a best-effort snapshot of the files that compose this TypedDataset.
Returns a best-effort snapshot of the files that compose this TypedDataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
apache/spark
Returns a new TypedDataset that contains only the elements of this TypedDataset that are also
present in other
.
Returns a new TypedDataset that contains only the elements of this TypedDataset that are also
present in other
.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
apache/spark
Returns true if this TypedDataset contains one or more sources that continuously return data as it arrives.
Returns true if this TypedDataset contains one or more sources that continuously
return data as it arrives. A TypedDataset that reads data from a streaming source
must be executed as a StreamingQuery
using the start()
method in
DataStreamWriter
. Methods that return a single answer, e.g. count()
or
collect()
, will throw an AnalysisException
when there is a streaming
source present.
apache/spark
Returns a new Dataset by taking the first n
rows.
Returns a new Dataset by taking the first n
rows. The difference between this function
and head
is that head
is an action and returns an array (by triggering query execution)
while limit
returns a new Dataset.
apache/spark
Persist this TypedDataset with the given storage level.
Persist this TypedDataset with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
, MEMORY_AND_DISK_2
, etc.
apache/spark
Prints the schema of the underlying Dataset
to the console in a nice tree format.
Prints the schema of the underlying Dataset
to the console in a nice tree format.
apache/spark
Returns a QueryExecution
from this TypedDataset.
Returns a QueryExecution
from this TypedDataset.
It is the primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.
apache/spark
Randomly splits this TypedDataset with the provided weights.
Randomly splits this TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
Randomly splits this TypedDataset with the provided weights.
Randomly splits this TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
Returns a Java list that contains randomly split TypedDataset with the provided weights.
Returns a Java list that contains randomly split TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
Converts this TypedDataset to an RDD.
Converts this TypedDataset to an RDD.
apache/spark
Returns a new TypedDataset that has exactly numPartitions
partitions.
Returns a new TypedDataset that has exactly numPartitions
partitions.
apache/spark
Returns a new TypedDataset by sampling a fraction of records.
Returns a new TypedDataset by sampling a fraction of records.
apache/spark
Returns the schema of this Dataset.
Returns the schema of this Dataset.
apache/spark
Returns a SparkSession
from this TypedDataset.
Returns a SQLContext
from this TypedDataset.
Get the TypedDataset's current storage level, or StorageLevel.NONE if not persisted.
Get the TypedDataset's current storage level, or StorageLevel.NONE if not persisted.
apache/spark
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
apache/spark
Returns the content of the TypedDataset as a Dataset of JSON strings.
Returns the content of the TypedDataset as a Dataset of JSON strings.
apache/spark
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
apache/spark
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
Whether to block until all blocks are deleted. apache/spark
Interface for saving the content of the non-streaming TypedDataset out into external storage.
Interface for saving the content of the non-streaming TypedDataset out into external storage.
apache/spark
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
This trait implements TypedDataset methods that have the same signature than their
Dataset
equivalent. Each method simply forwards the call to the underlyingDataset
.Documentation marked "apache/spark" is thanks to apache/spark Contributors at https://github.com/apache/spark, licensed under Apache v2.0 available at http://www.apache.org/licenses/LICENSE-2.0