Exposes a .size method to RDDs that mimics RDD.count, but caches the result.
It also exposes .sizes and .total on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs'
sizes (per above) in one Spark job.
Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes
(again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's
total.
Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps
that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for
testing.
Usage:
import org.hammerlab.magic.rdd.size._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size // 4; runs one job
(rdd1, rdd2).sizes // (4, 2); runs one job
(rdd1, rdd2).total // 6; runs no jobs
rdd2.size // 2; runs no jobs
Exposes a
.size
method to RDDs that mimics RDD.count, but caches the result.It also exposes
.sizes
and.total
on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.
Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.
Usage: