Class AllToAllVertexInputInfoComputer


  • public class AllToAllVertexInputInfoComputer
    extends Object
    Helper class that computes VertexInputInfo for all to all like inputs.
    • Constructor Detail

      • AllToAllVertexInputInfoComputer

        public AllToAllVertexInputInfoComputer​(double skewedFactor,
                                               long defaultSkewedThreshold)
    • Method Detail

      • compute

        public Map<IntermediateDataSetID,​JobVertexInputInfo> compute​(JobVertexID jobVertexId,
                                                                           List<BlockingInputInfo> inputInfos,
                                                                           int parallelism,
                                                                           int minParallelism,
                                                                           int maxParallelism,
                                                                           long dataVolumePerTask)
        Decide parallelism and input infos, which will make the data be evenly distributed to downstream subtasks for ALL_TO_ALL, such that different downstream subtasks consume roughly the same amount of data.

        Assume there are two input infos upstream, each with three partitions and two subpartitions, their data bytes information are: input1: 0->[1,1] 1->[2,2] 2->[3,3], input2: 0->[1,1] 1->[1,1] 2->[1,1]. This method processes the data as follows:
        1. Create subpartition slices for inputs with same type number, different from pointwise computer, this method creates subpartition slices by following these steps: Firstly, reorganize the data by subpartition index: input1: {0->[1,2,3],1->[1,2,3]}, input2: {0->[1,1,1],1->[1,1,1]}. Secondly, split subpartitions with the same index into relatively balanced n parts (if possible): {0->[1,2][3],1->[1,2][3]}, {0->[1,1,1],1->[1,1,1]}. Then perform a cartesian product operation to ensure data correctness input1: {0->[1,2],0->[3],1->[1,2],1->[3]}, input2: {0->[1,1,1],0->[1,1,1],1->[1,1,1],1->[1,1,1]}, Finally, create subpartition slices base on the result of the previous step. i.e., each input has four balanced subpartition slices.
        2. Based on the above subpartition slices, calculate the subpartition slice range each task needs to subscribe to, considering data volume and parallelism constraints: [0,0],[1,1],[2,2],[3,3]
        3. Convert the calculated subpartition slice range to the form of partition index range -> subpartition index range:
        task0: input1: {[0,1]->[0]} input2:{[0,2]->[0]}
        task1: input1: {[2,2]->[0]} input2:{[0,2]->[0]}
        task2: input1: {[0,1]->[1]} input2:{[0,2]->[1]}
        task3: input1: {[2,2]->[1]} input2:{[0,2]->[1]}

        Parameters:
        jobVertexId - The job vertex id
        inputInfos - The information of consumed blocking results
        parallelism - The parallelism of the job vertex
        minParallelism - the min parallelism
        maxParallelism - the max parallelism
        dataVolumePerTask - proposed data volume per task for this set of inputInfo
        Returns:
        the parallelism and vertex input infos