Package

org.hammerlab.magic

rdd

Permalink

package rdd

Visibility
  1. Public
  2. All

Type Members

  1. class IfRDD[T] extends Serializable

    Permalink

    Hang an iff method off of RDDs, as a small bit of syntactic sugar.

  2. class OrderedRepartitionRDD[T] extends Serializable

    Permalink

    Some helpers for repartitioning an RDD while retaining the order of its elements.

  3. class RunLengthRDD[T] extends AnyRef

    Permalink

    Helper for run-length encoding an RDD.

Value Members

  1. object IfRDD extends Serializable

    Permalink
  2. object OrderedRepartitionRDD extends Serializable

    Permalink
  3. object RunLengthRDD

    Permalink
  4. package cache

    Permalink
  5. package cmp

    Permalink
  6. package collect

    Permalink
  7. package grid

    Permalink
  8. package keyed

    Permalink
  9. package partitions

    Permalink
  10. package rev

    Permalink
  11. package scan

    Permalink
  12. package serde

    Permalink
  13. package size

    Permalink

    Exposes a .size method to RDDs that mimics RDD.count, but caches the result.

    Exposes a .size method to RDDs that mimics RDD.count, but caches the result.

    It also exposes .sizes and .total on Seq[RDD]s and 2- to 4-tuples which computes the constituent RDDs' sizes (per above) in one Spark job.

    Additionally, all the above APIs optimize computations on UnionRDDs by computing their component RDDs' sizes (again in just one job, along with any other RDDs being operated on) and caching those as well as the UnionRDD's total.

    Cached-size info is keyed by a org.apache.spark.SparkContext, as well as each RDD's ID, for robustness in apps that stop their org.apache.spark.SparkContext and then resume with a new one; this is especially useful for testing.

    Usage:

    import org.hammerlab.magic.rdd.size._
    val rdd1 = sc.parallelize(0 until 4)
    val rdd2 = sc.parallelize("a" :: "b" :: Nil)
    rdd1.size           // 4; runs one job
    (rdd1, rdd2).sizes  // (4, 2); runs one job
    (rdd1, rdd2).total  // 6; runs no jobs
    rdd2.size           // 2; runs no jobs
  14. package sliding

    Permalink
  15. package sort

    Permalink
  16. package zip

    Permalink

Ungrouped