trait TypedDatasetForwarded[T] extends AnyRef
This trait implements TypedDataset methods that have the same signature
than their Dataset
equivalent. Each method simply forwards the call to the
underlying Dataset
.
Documentation marked "apache/spark" is thanks to apache/spark Contributors at https://github.com/apache/spark, licensed under Apache v2.0 available at http://www.apache.org/licenses/LICENSE-2.0
- Self Type
- TypedDataset[T]
- Alphabetic
- By Inheritance
- TypedDatasetForwarded
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def cache(): TypedDataset[T]
Persist this TypedDataset with the default storage level (
MEMORY_AND_DISK
).Persist this TypedDataset with the default storage level (
MEMORY_AND_DISK
).apache/spark
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def coalesce(numPartitions: Int): TypedDataset[T]
Returns a new TypedDataset that has exactly
numPartitions
partitions.Returns a new TypedDataset that has exactly
numPartitions
partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.apache/spark
- def columns: Array[String]
Returns an
Array
that contains all column names in this TypedDataset. - def distinct: TypedDataset[T]
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.apache/spark
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def except(other: TypedDataset[T]): TypedDataset[T]
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to
EXCEPT
in SQL.Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.apache/spark
- def explain(extended: Boolean = false): Unit
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
apache/spark
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def inputFiles: Array[String]
Returns a best-effort snapshot of the files that compose this TypedDataset.
Returns a best-effort snapshot of the files that compose this TypedDataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
apache/spark
- def intersect(other: TypedDataset[T]): TypedDataset[T]
Returns a new TypedDataset that contains only the elements of this TypedDataset that are also present in
other
.Returns a new TypedDataset that contains only the elements of this TypedDataset that are also present in
other
.Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.apache/spark
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isLocal: Boolean
Returns true if the
collect
andtake
methods can be run locally (without any Spark executors).Returns true if the
collect
andtake
methods can be run locally (without any Spark executors).apache/spark
- def isStreaming: Boolean
Returns true if this TypedDataset contains one or more sources that continuously return data as it arrives.
Returns true if this TypedDataset contains one or more sources that continuously return data as it arrives. A TypedDataset that reads data from a streaming source must be executed as a
StreamingQuery
using thestart()
method inDataStreamWriter
. Methods that return a single answer, e.g.count()
orcollect()
, will throw anAnalysisException
when there is a streaming source present.apache/spark
- def limit(n: Int): TypedDataset[T]
Returns a new Dataset by taking the first
n
rows.Returns a new Dataset by taking the first
n
rows. The difference between this function andhead
is thathead
is an action and returns an array (by triggering query execution) whilelimit
returns a new Dataset.apache/spark
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def persist(newLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): TypedDataset[T]
Persist this TypedDataset with the given storage level.
Persist this TypedDataset with the given storage level.
- newLevel
One of:
MEMORY_ONLY
,MEMORY_AND_DISK
,MEMORY_ONLY_SER
,MEMORY_AND_DISK_SER
,DISK_ONLY
,MEMORY_ONLY_2
,MEMORY_AND_DISK_2
, etc. apache/spark
- def printSchema(): Unit
Prints the schema of the underlying
Dataset
to the console in a nice tree format.Prints the schema of the underlying
Dataset
to the console in a nice tree format.apache/spark
- def queryExecution: QueryExecution
Returns a
QueryExecution
from this TypedDataset.Returns a
QueryExecution
from this TypedDataset.It is the primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.
apache/spark
- def randomSplit(weights: Array[Double], seed: Long): Array[TypedDataset[T]]
Randomly splits this TypedDataset with the provided weights.
Randomly splits this TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
- def randomSplit(weights: Array[Double]): Array[TypedDataset[T]]
Randomly splits this TypedDataset with the provided weights.
Randomly splits this TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
- def randomSplitAsList(weights: Array[Double], seed: Long): List[TypedDataset[T]]
Returns a Java list that contains randomly split TypedDataset with the provided weights.
Returns a Java list that contains randomly split TypedDataset with the provided weights. Weights for splits, will be normalized if they don't sum to 1.
apache/spark
- def rdd: RDD[T]
Converts this TypedDataset to an RDD.
Converts this TypedDataset to an RDD.
apache/spark
- def repartition(numPartitions: Int): TypedDataset[T]
Returns a new TypedDataset that has exactly
numPartitions
partitions.Returns a new TypedDataset that has exactly
numPartitions
partitions.apache/spark
- def sample(withReplacement: Boolean, fraction: Double, seed: Long = Random.nextLong()): TypedDataset[T]
Returns a new TypedDataset by sampling a fraction of records.
Returns a new TypedDataset by sampling a fraction of records.
apache/spark
- def schema: StructType
Returns the schema of this Dataset.
Returns the schema of this Dataset.
apache/spark
- def sparkSession: SparkSession
Returns a
SparkSession
from this TypedDataset. - def sqlContext: SQLContext
Returns a
SQLContext
from this TypedDataset. - def storageLevel(): StorageLevel
Get the TypedDataset's current storage level, or StorageLevel.NONE if not persisted.
Get the TypedDataset's current storage level, or StorageLevel.NONE if not persisted.
apache/spark
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toDF(): DataFrame
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
apache/spark
- def toJSON: TypedDataset[String]
Returns the content of the TypedDataset as a Dataset of JSON strings.
Returns the content of the TypedDataset as a Dataset of JSON strings.
apache/spark
- def toString(): String
- Definition Classes
- TypedDatasetForwarded → AnyRef → Any
- def transform[U](t: (TypedDataset[T]) => TypedDataset[U]): TypedDataset[U]
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
apache/spark
- def unpersist(blocking: Boolean = false): TypedDataset[T]
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
- blocking
Whether to block until all blocks are deleted. apache/spark
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def write: DataFrameWriter[T]
Interface for saving the content of the non-streaming TypedDataset out into external storage.
Interface for saving the content of the non-streaming TypedDataset out into external storage.
apache/spark
- def writeStream: DataStreamWriter[T]
Interface for saving the content of the streaming Dataset out into external storage.
Interface for saving the content of the streaming Dataset out into external storage.
apache/spark
- object deserialized
Methods on
TypedDataset[T]
that go through a full serialization and deserialization ofT
, and execute outside of the Catalyst runtime.Methods on
TypedDataset[T]
that go through a full serialization and deserialization ofT
, and execute outside of the Catalyst runtime.The correct way to do a projection on a single column is to use the
select
method as follows:ds: TypedDataset[(String, String, String)] -> ds.select(ds('_2)).run()
Spark provides an alternative way to obtain the same resulting
Dataset
, using themap
method:ds: TypedDataset[(String, String, String)] -> ds.deserialized.map(_._2).run()
This second approach is however substantially slower than the first one, and should be avoided as possible. Indeed, under the hood this
map
will deserialize the entireTuple3
to an full JVM object, call the apply method of the_._2
closure on it, and serialize the resulting String back to its Catalyst representation.
Example:
Deprecated Value Members
- def filter(func: (T) => Boolean): TypedDataset[T]
- Annotations
- @deprecated
- Deprecated
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
- def flatMap[U](func: (T) => TraversableOnce[U])(implicit arg0: TypedEncoder[U]): TypedDataset[U]
- Annotations
- @deprecated
- Deprecated
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
- def map[U](func: (T) => U)(implicit arg0: TypedEncoder[U]): TypedDataset[U]
- Annotations
- @deprecated
- Deprecated
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
- def mapPartitions[U](func: (Iterator[T]) => Iterator[U])(implicit arg0: TypedEncoder[U]): TypedDataset[U]
- Annotations
- @deprecated
- Deprecated
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead
- def reduceOption[F[_]](func: (T, T) => T)(implicit arg0: SparkDelay[F]): F[Option[T]]
- Annotations
- @deprecated
- Deprecated
(Since version 0.4.0) deserialized methods have moved to a separate section to highlight their runtime overhead