Count RDD and cache result for RDD id
Compute multiple RDDs' sizes with a single Spark job; cache and return these sizes separately.
Compute multiple RDDs' sizes with a single Spark job; cache and return these sizes separately.
Along the way, this method will construct a UnionRDD from all non-cached, non-Union RDDs.
CachedCountRegistry adds a
.size
method to RDDs that mimicks RDD.count, but caches its result.It also exposes
.sizes
and.total
on Seq[RDD]s, Tuple2[RDD]s, and Tuple3[RDD]'s which compute the constituent RDDs' sizes (per above) in one Spark job.Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes and caching those as well as the UnionRDD's total.
Cached
size
info is keyed by a SparkContext for robustness in apps that stop their SparkContext and then resume with a new one; this is especially useful for testing!Usage: