CachedCountRegistry

CachedCountRegistry adds a .size method to RDDs that mimicks RDD.count, but caches its result.

It also exposes .sizes and .total on Seq[RDD]s, Tuple2[RDD]s, and Tuple3[RDD]'s which compute the constituent RDDs' sizes (per above) in one Spark job.

Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes and caching those as well as the UnionRDD's total.

Cached size info is keyed by a SparkContext for robustness in apps that stop their SparkContext and then resume with a new one; this is especially useful for testing!

Usage:

\ * import org.hammerlab.magic.rdd.CachedCountRegistry._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size()
(rdd1, rdd2).sizes()
(rdd1, rdd2).total()

Linear Supertypes

AnyRef, Any

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cachedCount(rdd: RDD[_]): Long

Count RDD and cache result for RDD id
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def multiCachedCount(rdds: Seq[RDD[_]]): Seq[Long]

Compute multiple RDDs' sizes with a single Spark job; cache and return these sizes separately.
Compute multiple RDDs' sizes with a single Spark job; cache and return these sizes separately.
Along the way, this method will construct a UnionRDD from all non-cached, non-Union RDDs.
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object CachedCountRegistry | package rdd

class CachedCountRegistry extends AnyRef

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def cachedCount(rdd: RDD[_]): Long

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def multiCachedCount(rdds: Seq[RDD[_]]): Seq[Long]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped