A more general method to do cogroup.
A more general method to do cogroup. Allows you to specify the transformations on the RDDs that will produce the join keys
rdd to join with
transform function to generate key for first rdd
transform function to generate key for second rdd
rdd1 cogroup rdd2
An efficient implementation of count by key
An efficient implementation of count by key
function of which to group by
rdd of keys and their counts
A more efficient implementation of groupBy and then reduceByKey
A more efficient implementation of groupBy and then reduceByKey
function of which to group by
function of which to generate the value from
combiner function
rdd of keys and the reduced values
An efficient implementation of sum by key
An efficient implementation of sum by key
function of which to group by
function of which to generate the value from
rdd of keys their sums
A more general method to do left outer joins.
A more general method to do left outer joins. Allows you to specify the transformations on the RDDs that will produce the join keys
rdd to join with
transform function to generate key for first rdd
transform function to generate key for second rdd
rdd1 join rdd2
A more general method to do left joins.
A more general method to do left joins. Allows you to specify the transformations on the RDDs that will produce the join keys
rdd to join with
transform function to generate key for first rdd
transform function to generate key for second rdd
rdd1 leftOuterJoin rdd2
Map over the RDD with an long accumulator
Map over the RDD with an long accumulator
map function
accumulator name
rdd.map(f)
A more general method to do full outer joins.
A more general method to do full outer joins. Allows you to specify the transformations on the RDDs that will produce the join keys
rdd to join with
transform function to generate key for first rdd
transform function to generate key for second rdd
rdd1 fullOuterJoin rdd2
Repartition to the min of floor(rdd.count / recordsPerPartition) + 1 and the previous number of partitions Attention: involves the execution of rdd.count() operation
Repartition to the min of floor(rdd.count / recordsPerPartition) + 1 and the previous number of partitions Attention: involves the execution of rdd.count() operation
number of records per partition
rdd with same number of partitions or floor(rdd.count / recordsPerPartition) + 1 partitions, whichever is less
Repartition to the min of maxPartitionCount and the previous number of partitions.
Repartition to the min of maxPartitionCount and the previous number of partitions. Usually you would want maxPartitionCount to be 200 to get around the 200 bug
max number of partitions (default 200)
shuffled rdd with same number of partitions or maxPartitionCount, whichever is less
Output the RDD to any Hadoop-supported file system, using a Hadoop OutputFormat
class
supporting the key and value types K and V in this RDD.
Output the RDD to any Hadoop-supported file system, using a Hadoop OutputFormat
class
supporting the key and value types K and V in this RDD.
path to write the data
optional codec to use
optional hadoop configuration
We should make sure our tasks are idempotent when speculation is enabled, i.e. do not use output committer that writes data directly. There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad result of using direct output committer with speculation enabled.
Splits this RDD in two RDDs according to a predicate.
Splits this RDD in two RDDs according to a predicate.
the predicate on which to split by.
a pair of RDDs: the RDD that satisfies the predicate p
and the RDD that does not.
An enhanced RDD with more general join methods.