Class VertexParallelismAndInputInfosDeciderUtils


  • public class VertexParallelismAndInputInfosDeciderUtils
    extends Object
    Utils class for VertexParallelismAndInputInfosDecider.
    • Constructor Detail

      • VertexParallelismAndInputInfosDeciderUtils

        public VertexParallelismAndInputInfosDeciderUtils()
    • Method Detail

      • adjustToClosestLegalParallelism

        public static Optional<List<IndexRange>> adjustToClosestLegalParallelism​(long currentDataVolumeLimit,
                                                                                 int currentParallelism,
                                                                                 int minParallelism,
                                                                                 int maxParallelism,
                                                                                 long minLimit,
                                                                                 long maxLimit,
                                                                                 Function<Long,​Integer> parallelismComputer,
                                                                                 Function<Long,​List<IndexRange>> subpartitionRangesComputer)
        Adjust the parallelism to the closest legal parallelism and return the computed subpartition ranges.
        Parameters:
        currentDataVolumeLimit - current data volume limit
        currentParallelism - current parallelism
        minParallelism - the min parallelism
        maxParallelism - the max parallelism
        minLimit - the minimum data volume limit
        maxLimit - the maximum data volume limit
        parallelismComputer - a function to compute the parallelism according to the data volume limit
        subpartitionRangesComputer - a function to compute the subpartition ranges according to the data volume limit
        Returns:
        the computed subpartition ranges or Optional.empty() if we can't find any legal parallelism
      • cartesianProduct

        public static <T> List<List<T>> cartesianProduct​(List<List<T>> lists)
        Computes the Cartesian product of a list of lists.

        The Cartesian product is a set of all possible combinations formed by picking one element from each list. For example, given input lists [[1, 2], [3, 4]], the result will be [[1, 3], [1, 4], [2, 3], [2, 4]].

        Note: If the input list is empty or contains an empty list, the result will be an empty list.

        Type Parameters:
        T - the type of elements in the lists
        Parameters:
        lists - a list of lists for which the Cartesian product is to be computed
        Returns:
        a list of lists representing the Cartesian product, where each inner list is a combination
      • median

        public static long median​(long[] nums)
        Calculates the median of a given array of long integers. If the calculated median is less than 1, it returns 1 instead.
        Parameters:
        nums - an array of long integers for which to calculate the median.
        Returns:
        the median value, which will be at least 1.
      • computeSkewThreshold

        public static long computeSkewThreshold​(long medianSize,
                                                double skewedFactor,
                                                long defaultSkewedThreshold)
        Computes the skew threshold based on the given media size and skewed factor.

        The skew threshold is calculated as the product of the media size and the skewed factor. To ensure that the computed threshold does not fall below a specified default value, the method uses Math.max(int, int) to return the largest of the calculated threshold and the default threshold.

        Parameters:
        medianSize - the size of the median
        skewedFactor - a factor indicating the degree of skewness
        defaultSkewedThreshold - the default threshold to be used if the calculated threshold is less than this value
        Returns:
        the computed skew threshold, which is guaranteed to be at least the default skewed threshold.
      • computeTargetSize

        public static long computeTargetSize​(long[] subpartitionBytes,
                                             long skewedThreshold,
                                             long dataVolumePerTask)
        Computes the target data size for each task based on the sizes of non-skewed subpartitions.

        The target size is determined as the average size of non-skewed subpartitions and ensures that the target size is at least equal to the specified data volume per task.

        Parameters:
        subpartitionBytes - an array representing the data size of each subpartition
        skewedThreshold - skewed threshold in bytes
        dataVolumePerTask - the amount of data that should be allocated per task
        Returns:
        the computed target size for each task, which is the maximum between the average size of non-skewed subpartitions and data volume per task.
      • hasSameNumPartitions

        public static boolean hasSameNumPartitions​(List<BlockingInputInfo> inputInfos)
      • getMaxNumPartitions

        public static int getMaxNumPartitions​(List<BlockingInputInfo> consumedResults)
      • checkAndGetSubpartitionNum

        public static int checkAndGetSubpartitionNum​(List<BlockingInputInfo> consumedResults)
      • isLegalParallelism

        public static boolean isLegalParallelism​(int parallelism,
                                                 int minParallelism,
                                                 int maxParallelism)
      • checkAndGetIntraCorrelation

        public static boolean checkAndGetIntraCorrelation​(List<BlockingInputInfo> inputInfos)
      • tryComputeSubpartitionSliceRange

        public static Optional<List<IndexRange>> tryComputeSubpartitionSliceRange​(int minParallelism,
                                                                                  int maxParallelism,
                                                                                  long maxDataVolumePerTask,
                                                                                  Map<Integer,​List<SubpartitionSlice>> subpartitionSlices)
        Attempts to compute the subpartition slice ranges to ensure even distribution of data across downstream tasks.

        This method first tries to compute the subpartition slice ranges by evenly distributing the data volume. If that fails, it attempts to compute the ranges by evenly distributing the number of subpartition slices.

        Parameters:
        minParallelism - The minimum parallelism.
        maxParallelism - The maximum parallelism.
        maxDataVolumePerTask - The maximum data volume per task.
        subpartitionSlices - A map of lists of subpartition slices grouped by type or index number.
        Returns:
        An Optional containing a list of index ranges representing the subpartition slice ranges. Returns an empty Optional if no suitable ranges can be computed.
      • calculateDataVolumePerTaskForInputsGroup

        public static long calculateDataVolumePerTaskForInputsGroup​(long globalDataVolumePerTask,
                                                                    List<BlockingInputInfo> inputsGroup,
                                                                    List<BlockingInputInfo> allInputs)
      • calculateDataVolumePerTaskForInput

        public static long calculateDataVolumePerTaskForInput​(long globalDataVolumePerTask,
                                                              long inputsGroupBytes,
                                                              long totalDataBytes)
      • logBalancedDataDistributionOptimizationResult

        public static void logBalancedDataDistributionOptimizationResult​(org.slf4j.Logger logger,
                                                                         JobVertexID jobVertexId,
                                                                         BlockingInputInfo inputInfo,
                                                                         JobVertexInputInfo optimizedJobVertexInputInfo)
        Logs the data distribution optimization info when a balanced data distribution algorithm is effectively optimized compared to the num-based data distribution algorithm.
        Parameters:
        logger - The logger instance used for logging output.
        jobVertexId - The id for the job vertex.
        inputInfo - The original input info
        optimizedJobVertexInputInfo - The optimized job vertex input info.