(Since version 1.0.0) use mapPartitionsWithIndex and filter
(Since version 1.0.0) use mapPartitionsWithIndex and flatMap
(Since version 1.0.0) use mapPartitionsWithIndex and foreach
(Since version 1.2.0) use TaskContext.get
(Since version 0.7.0) use mapPartitionsWithIndex
(Since version 1.0.0) use mapPartitionsWithIndex
(Since version 1.0.0) use collect
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (
dependency), and a optional array of partition start indices as input arguments (specifiedPartitionStartIndices).The
dependencyhas the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1].dependency.partitioneris the original partitioner used to partition map output, anddependency.partitioner.numPartitionsis the number of pre-shuffle partitions (i.e. the number of partitions of the map output).When
specifiedPartitionStartIndicesis defined,specifiedPartitionStartIndices.lengthwill be the number of post-shuffle partitions. For this case, theith post-shuffle partition includesspecifiedPartitionStartIndices[i]tospecifiedPartitionStartIndices[i+1] - 1(inclusive).When
specifiedPartitionStartIndicesis not defined, there will bedependency.partitioner.numPartitionspost-shuffle partitions. For this case, a post-shuffle partition is created for every pre-shuffle partition.