rdd

class CachedCountRegistry extends AnyRef

CachedCountRegistry adds a .size method to RDDs that mimicks RDD.count, but caches its result.
CachedCountRegistry adds a .size method to RDDs that mimicks RDD.count, but caches its result.
It also exposes .sizes and .total on Seq[RDD]s, which compute the constituent RDDs' sizes (per above) in one Spark job.
Additionally, both sets of APIs optimize computations on UnionRDDs by computing their component RDDs' sizes and caching those as well as the UnionRDD's total.
Cached size info is keyed by a SparkContext for robustness in apps that stop their SparkContext and then resume with a new one; this is especially useful for testing!
Usage:
```
\ * import org.hammerlab.magic.rdd.CachedCountRegistry._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size()
(rdd1 :: rdd2 :: Nil).sizes()
(rdd1 :: rdd2 :: Nil).total()
```
class IfRDD[T] extends Serializable

Hang an iff method off of RDDs, as a small bit of syntactic sugar.
case class KeyPartitioner(numPartitions: Int) extends Partitioner with Product with Serializable

Spark Partitioner that maps elements to a partition indicated by an Int that either is the key, or is the first element of a tuple.
class OrderedRepartitionRDD[T] extends Serializable

Some helpers for repartitioning an RDD while retaining the order of its elements.
class RunLengthRDD[T] extends AnyRef

Helper for run-length encoding an RDD.