Packages

  • package root
    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package apache
    Definition Classes
    org
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
    apache
  • package api
    Definition Classes
    spark
  • package broadcast

    Spark's broadcast variables, used to broadcast immutable datasets to all nodes.

    Spark's broadcast variables, used to broadcast immutable datasets to all nodes.

    Definition Classes
    spark
  • package deploy
    Definition Classes
    spark
  • package executor

    Executor components used with various cluster managers.

    Executor components used with various cluster managers. See org.apache.spark.executor.Executor.

    Definition Classes
    spark
  • package input
    Definition Classes
    spark
  • package internal
    Definition Classes
    spark
  • package io

    IO codecs used for compression.

    IO codecs used for compression. See org.apache.spark.io.CompressionCodec.

    Definition Classes
    spark
  • package mapred
    Definition Classes
    spark
  • package memory

    This package implements Spark's memory management system.

    This package implements Spark's memory management system. This system consists of two main components, a JVM-wide memory manager and a per-task manager:

    • org.apache.spark.memory.MemoryManager manages Spark's overall memory usage within a JVM. This component implements the policies for dividing the available memory across tasks and for allocating memory between storage (memory used caching and data transfer) and execution (memory used by computations, such as shuffles, joins, sorts, and aggregations).
    • org.apache.spark.memory.TaskMemoryManager manages the memory allocated by individual tasks. Tasks interact with TaskMemoryManager and never directly interact with the JVM-wide MemoryManager.

    Internally, each of these components have additional abstractions for memory bookkeeping:

    • org.apache.spark.memory.MemoryConsumers are clients of the TaskMemoryManager and correspond to individual operators and data structures within a task. The TaskMemoryManager receives memory allocation requests from MemoryConsumers and issues callbacks to consumers in order to trigger spilling when running low on memory.
    • org.apache.spark.memory.MemoryPools are a bookkeeping abstraction used by the MemoryManager to track the division of memory between storage and execution.

    Diagrammatically:

                                                           +---------------------------+
    +-------------+                                        |       MemoryManager       |
    | MemConsumer |----+                                   |                           |
    +-------------+    |    +-------------------+          |  +---------------------+  |
                       +--->| TaskMemoryManager |----+     |  |OnHeapStorageMemPool |  |
    +-------------+    |    +-------------------+    |     |  +---------------------+  |
    | MemConsumer |----+                             |     |                           |
    +-------------+         +-------------------+    |     |  +---------------------+  |
                            | TaskMemoryManager |----+     |  |OffHeapStorageMemPool|  |
                            +-------------------+    |     |  +---------------------+  |
                                                     +---->|                           |
                                     *               |     |  +---------------------+  |
                                     *               |     |  |OnHeapExecMemPool    |  |
    +-------------+                  *               |     |  +---------------------+  |
    | MemConsumer |----+                             |     |                           |
    +-------------+    |    +-------------------+    |     |  +---------------------+  |
                       +--->| TaskMemoryManager |----+     |  |OffHeapExecMemPool   |  |
                            +-------------------+          |  +---------------------+  |
                                                           |                           |
                                                           +---------------------------+

    There is one implementation of org.apache.spark.memory.MemoryManager:

    • org.apache.spark.memory.UnifiedMemoryManager enforces soft boundaries between storage and execution memory, allowing requests for memory in one region to be fulfilled by borrowing memory from the other.
    Definition Classes
    spark
  • package metrics
    Definition Classes
    spark
  • package network
    Definition Classes
    spark
  • package partial

    Support for approximate results.

    Support for approximate results. This provides convenient api and also implementation for approximate calculation.

    Definition Classes
    spark
    See also

    org.apache.spark.rdd.RDD.countApprox

  • package rdd

    Provides several RDD implementations.

    Provides several RDD implementations. See org.apache.spark.rdd.RDD.

    Definition Classes
    spark
  • AsyncRDDActions
  • CoGroupedRDD
  • DoubleRDDFunctions
  • HadoopRDD
  • JdbcRDD
  • NewHadoopRDD
  • OrderedRDDFunctions
  • PairRDDFunctions
  • PartitionCoalescer
  • PartitionGroup
  • PartitionPruningRDD
  • RDD
  • RDDBarrier
  • SequenceFileRDDFunctions
  • ShuffledRDD
  • UnionRDD
  • package resource
    Definition Classes
    spark
  • package scheduler

    Spark's scheduling components.

    Spark's scheduling components. This includes the org.apache.spark.scheduler.DAGScheduler and lower level org.apache.spark.scheduler.TaskScheduler.

    Definition Classes
    spark
  • package security
    Definition Classes
    spark
  • package serializer

    Pluggable serializers for RDD and shuffle data.

    Pluggable serializers for RDD and shuffle data.

    Definition Classes
    spark
    See also

    org.apache.spark.serializer.Serializer

  • package shuffle
    Definition Classes
    spark
  • package status
    Definition Classes
    spark
  • package storage
    Definition Classes
    spark
  • package unsafe
    Definition Classes
    spark
  • package util

    Spark utilities.

    Spark utilities.

    Definition Classes
    spark

package rdd

Provides several RDD implementations. See org.apache.spark.rdd.RDD.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. rdd
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class AsyncRDDActions[T] extends Serializable with Logging

    A set of asynchronous RDD actions available through an implicit conversion.

  2. class CoGroupedRDD[K] extends RDD[(K, Array[Iterable[_]])]

    :: DeveloperApi :: An RDD that cogroups its parents.

    :: DeveloperApi :: An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.

    Annotations
    @DeveloperApi()
    Note

    This is an internal API. We recommend users use RDD.cogroup(...) instead of instantiating this directly.

  3. class DoubleRDDFunctions extends Logging with Serializable

    Extra functions available on RDDs of Doubles through an implicit conversion.

  4. class HadoopRDD[K, V] extends RDD[(K, V)] with Logging

    :: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

    :: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

    Annotations
    @DeveloperApi()
    Note

    Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.hadoopRDD()

  5. class JdbcRDD[T] extends RDD[T] with Logging

    An RDD that executes a SQL query on a JDBC connection and reads results.

    An RDD that executes a SQL query on a JDBC connection and reads results. For usage example, see test case JdbcRDDSuite.

  6. class NewHadoopRDD[K, V] extends RDD[(K, V)] with Logging

    :: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).

    :: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).

    Annotations
    @DeveloperApi()
    Note

    Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.newAPIHadoopRDD()

  7. class OrderedRDDFunctions[K, V, P <: Product2[K, V]] extends Logging with Serializable

    Extra functions available on RDDs of (key, value) pairs where the key is sortable through an implicit conversion.

    Extra functions available on RDDs of (key, value) pairs where the key is sortable through an implicit conversion. They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.

    import org.apache.spark.SparkContext._
    
    val rdd: RDD[(String, Int)] = ...
    implicit val caseInsensitiveOrdering = new Ordering[String] {
      override def compare(a: String, b: String) =
        a.toLowerCase(Locale.ROOT).compare(b.toLowerCase(Locale.ROOT))
    }
    
    // Sort by key, using the above case insensitive ordering.
    rdd.sortByKey()
  8. class PairRDDFunctions[K, V] extends Logging with Serializable

    Extra functions available on RDDs of (key, value) pairs through an implicit conversion.

  9. trait PartitionCoalescer extends AnyRef

    ::DeveloperApi:: A PartitionCoalescer defines how to coalesce the partitions of a given RDD.

    ::DeveloperApi:: A PartitionCoalescer defines how to coalesce the partitions of a given RDD.

    Annotations
    @DeveloperApi()
  10. class PartitionGroup extends AnyRef

    ::DeveloperApi:: A group of Partitions

    ::DeveloperApi:: A group of Partitions

    Annotations
    @DeveloperApi()
  11. class PartitionPruningRDD[T] extends RDD[T]

    :: DeveloperApi :: An RDD used to prune RDD partitions/partitions so we can avoid launching tasks on all partitions.

    :: DeveloperApi :: An RDD used to prune RDD partitions/partitions so we can avoid launching tasks on all partitions. An example use case: If we know the RDD is partitioned by range, and the execution DAG has a filter on the key, we can avoid launching tasks on partitions that don't have the range covering the key.

    Annotations
    @DeveloperApi()
  12. abstract class RDD[T] extends Serializable with Logging

    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.

    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]) through implicit.

    Internally, each RDD is characterized by five main properties:

    • A list of partitions
    • A function for computing each split
    • A list of dependencies on other RDDs
    • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
    • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

    All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Please refer to the Spark paper for more details on RDD internals.

  13. class RDDBarrier[T] extends AnyRef

    :: Experimental :: Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together.

    :: Experimental :: Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together. org.apache.spark.rdd.RDDBarrier instances are created by org.apache.spark.rdd.RDD#barrier.

    Annotations
    @Experimental() @Since( "2.4.0" )
  14. class SequenceFileRDDFunctions[K, V] extends Logging with Serializable

    Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.

    Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.

    Note

    This can't be part of PairRDDFunctions because we need more implicit parameters to convert our keys and values to Writable.

  15. class ShuffledRDD[K, V, C] extends RDD[(K, C)]

    :: DeveloperApi :: The resulting RDD from a shuffle (e.g.

    :: DeveloperApi :: The resulting RDD from a shuffle (e.g. repartitioning of data).

    K

    the key class.

    V

    the value class.

    C

    the combiner class.

    Annotations
    @DeveloperApi()
  16. class UnionRDD[T] extends RDD[T]
    Annotations
    @DeveloperApi()

Value Members

  1. object JdbcRDD extends Serializable
  2. object PartitionPruningRDD extends Serializable
    Annotations
    @DeveloperApi()
  3. object RDD extends Serializable

    Defines implicit functions that provide extra functionalities on RDDs of specific types.

    Defines implicit functions that provide extra functionalities on RDDs of specific types.

    For example, RDD.rddToPairRDDFunctions converts an RDD into a PairRDDFunctions for key-value-pair RDDs, and enabling extra functionalities such as PairRDDFunctions.reduceByKey.

  4. object UnionRDD extends Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped