Class PointwiseVertexInputInfoComputer


  • public class PointwiseVertexInputInfoComputer
    extends Object
    Helper class that computes VertexInputInfo for pointwise input.
    • Constructor Detail

      • PointwiseVertexInputInfoComputer

        public PointwiseVertexInputInfoComputer()
    • Method Detail

      • compute

        public Map<IntermediateDataSetID,​JobVertexInputInfo> compute​(List<BlockingInputInfo> inputInfos,
                                                                           int parallelism,
                                                                           int minParallelism,
                                                                           int maxParallelism,
                                                                           long dataVolumePerTask)
        Decide parallelism and input infos, which will make the data be evenly distributed to downstream subtasks for POINTWISE, such that different downstream subtasks consume roughly the same amount of data.

        Assume that `inputInfo` has two partitions, each partition has three subpartitions, their data bytes are: {0->[1,2,1], 1->[2,1,2]}, and the expected parallelism is 3. The calculation process is as follows:
        1. Create subpartition slices for input which is composed of several subpartitions. The created slice list and its data bytes are: [1,2,1,2,1,2]
        2. Distribute the subpartition slices array into n balanced parts (described by `IndexRange`, named SubpartitionSliceRanges) based on data volume: [0,1],[2,3],[4,5]
        3. Reorganize the distributed results into a mapping of partition range to subpartition range: {0 -> [0,1]}, {0->[2,2],1->[0,0]}, {1->[1,2]}.
        The final result is the `SubpartitionGroup` that each of the three parallel tasks need to subscribe.

        Parameters:
        inputInfos - The information of consumed blocking results
        parallelism - The parallelism of the job vertex
        minParallelism - the min parallelism
        maxParallelism - the max parallelism
        dataVolumePerTask - proposed data volume per task for this set of inputInfo
        Returns:
        the parallelism and vertex input infos