size

Exposes a .size method to RDDs that mimics RDD.count, but caches the result.

It also exposes .sizes and .total on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.

Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.

Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.

Usage:

import org.hammerlab.magic.rdd.size._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size           // 4; runs one job
(rdd1, rdd2).sizes  // (4, 2); runs one job
(rdd1, rdd2).total  // 6; runs no jobs
rdd2.size           // 2; runs no jobs

Linear Supertypes

MultiRDDCache[Any, Long], RDDCache[Any, Long], AnyRef, Any

Type Members

class HasMultiRDDSize extends AnyRef
implicit class MultiRDDSize extends AnyRef
implicit class SingleRDDSize extends AnyRef
implicit class Tuple2RDDSize extends HasMultiRDDSize
implicit class Tuple3RDDSize extends HasMultiRDDSize
implicit class Tuple4RDDSize extends HasMultiRDDSize

Abstract Value Members

abstract def compute(rdd: RDD[T]): U

Attributes
protected
Definition Classes
RDDCache

Concrete Value Members

def apply(rdd: RDD[T]): U

Definition Classes
RDDCache
def contains(rdd: RDD[T]): Boolean

Attributes
protected
Definition Classes
RDDCache
def getCache(implicit sc: SparkContext): Map[Int, U]

Get current cache state; exposed for testing.
Get current cache state; exposed for testing.

Definition Classes
RDDCache
implicit def unwrapRDD(rdd: RDD[_]): (SparkContext, Int)

Attributes
protected
Definition Classes
RDDCache
final def update(rdd: RDD[T], value: U): Unit

Attributes
protected
Definition Classes
RDDCache
final def update(pair: (RDD[T], U)): Unit

Attributes
protected
Definition Classes
RDDCache

package size

Type Members

class HasMultiRDDSize extends AnyRef

implicit class MultiRDDSize extends AnyRef

implicit class SingleRDDSize extends AnyRef

implicit class Tuple2RDDSize extends HasMultiRDDSize

implicit class Tuple3RDDSize extends HasMultiRDDSize

implicit class Tuple4RDDSize extends HasMultiRDDSize

Abstract Value Members

abstract def compute(rdd: RDD[T]): U

Concrete Value Members

def apply(rdd: RDD[T]): U

def contains(rdd: RDD[T]): Boolean

def getCache(implicit sc: SparkContext): Map[Int, U]

implicit def unwrapRDD(rdd: RDD[_]): (SparkContext, Int)

final def update(rdd: RDD[T], value: U): Unit

final def update(pair: (RDD[T], U)): Unit

Inherited from MultiRDDCache[Any, Long]

Inherited from RDDCache[Any, Long]

Inherited from AnyRef

Inherited from Any

Ungrouped