CachedCountRegistry adds a .size
method to RDDs that mimicks RDD.count, but caches its result.
Hang an iff
method off of RDDs, as a small bit of syntactic sugar.
Spark Partitioner that maps elements to a partition indicated by an Int that either is the key, or is the first element of a tuple.
Some helpers for repartitioning an RDD while retaining the order of its elements.
Helper for run-length encoding an RDD.
CachedCountRegistry adds a
.size
method to RDDs that mimicks RDD.count, but caches its result.It also exposes
.sizes
and.total
on Seq[RDD]s, which compute the constituent RDDs' sizes (per above) in one Spark job.Additionally, both sets of APIs optimize computations on UnionRDDs by computing their component RDDs' sizes and caching those as well as the UnionRDD's total.
Cached
size
info is keyed by a SparkContext for robustness in apps that stop their SparkContext and then resume with a new one; this is especially useful for testing!Usage: