Package

org.hammerlab.magic.rdd

size

Permalink

package size

Exposes a .size method to RDDs that mimics RDD.count, but caches the result.

It also exposes .sizes and .total on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.

Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.

Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.

Usage:

import org.hammerlab.magic.rdd.size._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size           // 4; runs one job
(rdd1, rdd2).sizes  // (4, 2); runs one job
(rdd1, rdd2).total  // 6; runs no jobs
rdd2.size           // 2; runs no jobs
Linear Supertypes
MultiRDDCache[Any, Long], RDDCache[Any, Long], AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. size
  2. MultiRDDCache
  3. RDDCache
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class HasMultiRDDSize extends AnyRef

    Permalink
  2. implicit class MultiRDDSize extends AnyRef

    Permalink
  3. implicit class SingleRDDSize extends AnyRef

    Permalink
  4. implicit class Tuple2RDDSize extends HasMultiRDDSize

    Permalink
  5. implicit class Tuple3RDDSize extends HasMultiRDDSize

    Permalink
  6. implicit class Tuple4RDDSize extends HasMultiRDDSize

    Permalink

Abstract Value Members

  1. abstract def compute(rdd: RDD[T]): U

    Permalink
    Attributes
    protected
    Definition Classes
    RDDCache

Concrete Value Members

  1. def apply(rdd: RDD[T]): U

    Permalink
    Definition Classes
    RDDCache
  2. def contains(rdd: RDD[T]): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    RDDCache
  3. def getCache(implicit sc: SparkContext): Map[Int, U]

    Permalink

    Get current cache state; exposed for testing.

    Get current cache state; exposed for testing.

    Definition Classes
    RDDCache
  4. implicit def unwrapRDD(rdd: RDD[_]): (SparkContext, Int)

    Permalink
    Attributes
    protected
    Definition Classes
    RDDCache
  5. final def update(rdd: RDD[T], value: U): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    RDDCache
  6. final def update(pair: (RDD[T], U)): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    RDDCache

Inherited from MultiRDDCache[Any, Long]

Inherited from RDDCache[Any, Long]

Inherited from AnyRef

Inherited from Any

Ungrouped