Wrap an RDD and expose a cappedGroupByKey
method, which behaves like
org.apache.spark.rdd.PairRDDFunctions.groupByKey but with a cap on the number of values that will be accumulated
for each key.
Add splitByKey method to any RDD of pairs: returns a Map from each key (K) to an RDD[V] with all the values that had that key in the original RDD (with relative order preserved for each key).
Add splitByKey method to any RDD of pairs: returns a Map from each key (K) to an RDD[V] with all the values that had that key in the original RDD (with relative order preserved for each key).
One shuffle stage on all keys and their values yields an RDD whose partitions are arranged in disjoint, contiguous regions corresponding to all the values for each key; this is much more efficient than a naive approach to separating RDDs by key: performing an RDD.filter for each key in the RDD;.
However, it's worth noting that breaking up an RDD into a collection of RDDs in this way is fairly unidiomatic, and if one finds themselves wanting this it's worth pausing and considering taking different actions upstream.
Paired RDD to split up by key.
Wrap an RDD and expose a
cappedGroupByKey
method, which behaves like org.apache.spark.rdd.PairRDDFunctions.groupByKey but with a cap on the number of values that will be accumulated for each key.Takes the first values for each key, discarding the rest; to obtain a random sampling of the elements for each key, see SampleByKeyRDD.