Hang an iff
method off of RDDs, as a small bit of syntactic sugar.
Some helpers for repartitioning an RDD while retaining the order of its elements.
Helper for run-length encoding an RDD.
Exposes a .size
method to RDDs that mimics RDD.count, but caches the result.
Exposes a
.size
method to RDDs that mimics RDD.count, but caches the result.It also exposes
.sizes
and.total
on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.
Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.
Usage: