Package

org.hammerlab.magic

rdd

Permalink

package rdd

Visibility

Public
All

Type Members

class IfRDD[T] extends Serializable

Hang an iff method off of RDDs, as a small bit of syntactic sugar.
class OrderedRepartitionRDD[T] extends Serializable

Some helpers for repartitioning an RDD while retaining the order of its elements.
class RunLengthRDD[T] extends AnyRef

Helper for run-length encoding an RDD.

Value Members

object IfRDD extends Serializable
object OrderedRepartitionRDD extends Serializable
object RunLengthRDD
package cache
package cmp
package collect
package grid
package keyed
package partitions
package rev
package scan
package serde
package size

Exposes a .size method to RDDs that mimics RDD.count, but caches the result.
Exposes a .size method to RDDs that mimics RDD.count, but caches the result.
It also exposes .sizes and .total on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.
Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.
Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.
Usage:
```
import org.hammerlab.magic.rdd.size._
val rdd1 = sc.parallelize(0 until 4)
val rdd2 = sc.parallelize("a" :: "b" :: Nil)
rdd1.size           // 4; runs one job
(rdd1, rdd2).sizes  // (4, 2); runs one job
(rdd1, rdd2).total  // 6; runs no jobs
rdd2.size           // 2; runs no jobs
```
package sliding
package sort
package zip

Ungrouped