Class VertexParallelismAndInputInfosDeciderUtils
- java.lang.Object
-
- org.apache.flink.runtime.scheduler.adaptivebatch.util.VertexParallelismAndInputInfosDeciderUtils
-
public class VertexParallelismAndInputInfosDeciderUtils extends Object
Utils class for VertexParallelismAndInputInfosDecider.
-
-
Constructor Summary
Constructors Constructor Description VertexParallelismAndInputInfosDeciderUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static Optional<List<IndexRange>>
adjustToClosestLegalParallelism(long currentDataVolumeLimit, int currentParallelism, int minParallelism, int maxParallelism, long minLimit, long maxLimit, Function<Long,Integer> parallelismComputer, Function<Long,List<IndexRange>> subpartitionRangesComputer)
Adjust the parallelism to the closest legal parallelism and return the computed subpartition ranges.static long
calculateDataVolumePerTaskForInput(long globalDataVolumePerTask, long inputsGroupBytes, long totalDataBytes)
static long
calculateDataVolumePerTaskForInputsGroup(long globalDataVolumePerTask, List<BlockingInputInfo> inputsGroup, List<BlockingInputInfo> allInputs)
static <T> List<List<T>>
cartesianProduct(List<List<T>> lists)
Computes the Cartesian product of a list of lists.static boolean
checkAndGetIntraCorrelation(List<BlockingInputInfo> inputInfos)
static int
checkAndGetParallelism(Collection<JobVertexInputInfo> vertexInputInfos)
static int
checkAndGetSubpartitionNum(List<BlockingInputInfo> consumedResults)
static int
checkAndGetSubpartitionNumForAggregatedInputs(Collection<AggregatedBlockingInputInfo> inputInfos)
static long
computeSkewThreshold(long medianSize, double skewedFactor, long defaultSkewedThreshold)
Computes the skew threshold based on the given media size and skewed factor.static long
computeTargetSize(long[] subpartitionBytes, long skewedThreshold, long dataVolumePerTask)
Computes the target data size for each task based on the sizes of non-skewed subpartitions.static JobVertexInputInfo
createdJobVertexInputInfoForBroadcast(BlockingInputInfo inputInfo, int parallelism)
static JobVertexInputInfo
createdJobVertexInputInfoForNonBroadcast(BlockingInputInfo inputInfo, List<IndexRange> subpartitionSliceRanges, List<SubpartitionSlice> subpartitionSlices)
static Map<IntermediateDataSetID,JobVertexInputInfo>
createJobVertexInputInfos(List<BlockingInputInfo> inputInfos, Map<Integer,List<SubpartitionSlice>> subpartitionSlices, List<IndexRange> subpartitionSliceRanges, Function<Integer,Integer> subpartitionSliceKeyResolver)
static int
getMaxNumPartitions(List<BlockingInputInfo> consumedResults)
static List<BlockingInputInfo>
getNonBroadcastInputInfos(List<BlockingInputInfo> consumedResults)
static boolean
hasSameNumPartitions(List<BlockingInputInfo> inputInfos)
static boolean
isLegalParallelism(int parallelism, int minParallelism, int maxParallelism)
static void
logBalancedDataDistributionOptimizationResult(org.slf4j.Logger logger, JobVertexID jobVertexId, BlockingInputInfo inputInfo, JobVertexInputInfo optimizedJobVertexInputInfo)
Logs the data distribution optimization info when a balanced data distribution algorithm is effectively optimized compared to the num-based data distribution algorithm.static long
median(long[] nums)
Calculates the median of a given array of long integers.static Optional<List<IndexRange>>
tryComputeSubpartitionSliceRange(int minParallelism, int maxParallelism, long maxDataVolumePerTask, Map<Integer,List<SubpartitionSlice>> subpartitionSlices)
Attempts to compute the subpartition slice ranges to ensure even distribution of data across downstream tasks.
-
-
-
Method Detail
-
adjustToClosestLegalParallelism
public static Optional<List<IndexRange>> adjustToClosestLegalParallelism(long currentDataVolumeLimit, int currentParallelism, int minParallelism, int maxParallelism, long minLimit, long maxLimit, Function<Long,Integer> parallelismComputer, Function<Long,List<IndexRange>> subpartitionRangesComputer)
Adjust the parallelism to the closest legal parallelism and return the computed subpartition ranges.- Parameters:
currentDataVolumeLimit
- current data volume limitcurrentParallelism
- current parallelismminParallelism
- the min parallelismmaxParallelism
- the max parallelismminLimit
- the minimum data volume limitmaxLimit
- the maximum data volume limitparallelismComputer
- a function to compute the parallelism according to the data volume limitsubpartitionRangesComputer
- a function to compute the subpartition ranges according to the data volume limit- Returns:
- the computed subpartition ranges or
Optional.empty()
if we can't find any legal parallelism
-
cartesianProduct
public static <T> List<List<T>> cartesianProduct(List<List<T>> lists)
Computes the Cartesian product of a list of lists.The Cartesian product is a set of all possible combinations formed by picking one element from each list. For example, given input lists [[1, 2], [3, 4]], the result will be [[1, 3], [1, 4], [2, 3], [2, 4]].
Note: If the input list is empty or contains an empty list, the result will be an empty list.
- Type Parameters:
T
- the type of elements in the lists- Parameters:
lists
- a list of lists for which the Cartesian product is to be computed- Returns:
- a list of lists representing the Cartesian product, where each inner list is a combination
-
median
public static long median(long[] nums)
Calculates the median of a given array of long integers. If the calculated median is less than 1, it returns 1 instead.- Parameters:
nums
- an array of long integers for which to calculate the median.- Returns:
- the median value, which will be at least 1.
-
computeSkewThreshold
public static long computeSkewThreshold(long medianSize, double skewedFactor, long defaultSkewedThreshold)
Computes the skew threshold based on the given media size and skewed factor.The skew threshold is calculated as the product of the media size and the skewed factor. To ensure that the computed threshold does not fall below a specified default value, the method uses
Math.max(int, int)
to return the largest of the calculated threshold and the default threshold.- Parameters:
medianSize
- the size of the medianskewedFactor
- a factor indicating the degree of skewnessdefaultSkewedThreshold
- the default threshold to be used if the calculated threshold is less than this value- Returns:
- the computed skew threshold, which is guaranteed to be at least the default skewed threshold.
-
computeTargetSize
public static long computeTargetSize(long[] subpartitionBytes, long skewedThreshold, long dataVolumePerTask)
Computes the target data size for each task based on the sizes of non-skewed subpartitions.The target size is determined as the average size of non-skewed subpartitions and ensures that the target size is at least equal to the specified data volume per task.
- Parameters:
subpartitionBytes
- an array representing the data size of each subpartitionskewedThreshold
- skewed threshold in bytesdataVolumePerTask
- the amount of data that should be allocated per task- Returns:
- the computed target size for each task, which is the maximum between the average size of non-skewed subpartitions and data volume per task.
-
getNonBroadcastInputInfos
public static List<BlockingInputInfo> getNonBroadcastInputInfos(List<BlockingInputInfo> consumedResults)
-
hasSameNumPartitions
public static boolean hasSameNumPartitions(List<BlockingInputInfo> inputInfos)
-
getMaxNumPartitions
public static int getMaxNumPartitions(List<BlockingInputInfo> consumedResults)
-
checkAndGetSubpartitionNum
public static int checkAndGetSubpartitionNum(List<BlockingInputInfo> consumedResults)
-
checkAndGetSubpartitionNumForAggregatedInputs
public static int checkAndGetSubpartitionNumForAggregatedInputs(Collection<AggregatedBlockingInputInfo> inputInfos)
-
isLegalParallelism
public static boolean isLegalParallelism(int parallelism, int minParallelism, int maxParallelism)
-
checkAndGetIntraCorrelation
public static boolean checkAndGetIntraCorrelation(List<BlockingInputInfo> inputInfos)
-
checkAndGetParallelism
public static int checkAndGetParallelism(Collection<JobVertexInputInfo> vertexInputInfos)
-
tryComputeSubpartitionSliceRange
public static Optional<List<IndexRange>> tryComputeSubpartitionSliceRange(int minParallelism, int maxParallelism, long maxDataVolumePerTask, Map<Integer,List<SubpartitionSlice>> subpartitionSlices)
Attempts to compute the subpartition slice ranges to ensure even distribution of data across downstream tasks.This method first tries to compute the subpartition slice ranges by evenly distributing the data volume. If that fails, it attempts to compute the ranges by evenly distributing the number of subpartition slices.
- Parameters:
minParallelism
- The minimum parallelism.maxParallelism
- The maximum parallelism.maxDataVolumePerTask
- The maximum data volume per task.subpartitionSlices
- A map of lists of subpartition slices grouped by type or index number.- Returns:
- An
Optional
containing a list of index ranges representing the subpartition slice ranges. Returns an emptyOptional
if no suitable ranges can be computed.
-
createJobVertexInputInfos
public static Map<IntermediateDataSetID,JobVertexInputInfo> createJobVertexInputInfos(List<BlockingInputInfo> inputInfos, Map<Integer,List<SubpartitionSlice>> subpartitionSlices, List<IndexRange> subpartitionSliceRanges, Function<Integer,Integer> subpartitionSliceKeyResolver)
-
createdJobVertexInputInfoForBroadcast
public static JobVertexInputInfo createdJobVertexInputInfoForBroadcast(BlockingInputInfo inputInfo, int parallelism)
-
createdJobVertexInputInfoForNonBroadcast
public static JobVertexInputInfo createdJobVertexInputInfoForNonBroadcast(BlockingInputInfo inputInfo, List<IndexRange> subpartitionSliceRanges, List<SubpartitionSlice> subpartitionSlices)
-
calculateDataVolumePerTaskForInputsGroup
public static long calculateDataVolumePerTaskForInputsGroup(long globalDataVolumePerTask, List<BlockingInputInfo> inputsGroup, List<BlockingInputInfo> allInputs)
-
calculateDataVolumePerTaskForInput
public static long calculateDataVolumePerTaskForInput(long globalDataVolumePerTask, long inputsGroupBytes, long totalDataBytes)
-
logBalancedDataDistributionOptimizationResult
public static void logBalancedDataDistributionOptimizationResult(org.slf4j.Logger logger, JobVertexID jobVertexId, BlockingInputInfo inputInfo, JobVertexInputInfo optimizedJobVertexInputInfo)
Logs the data distribution optimization info when a balanced data distribution algorithm is effectively optimized compared to the num-based data distribution algorithm.- Parameters:
logger
- The logger instance used for logging output.jobVertexId
- The id for the job vertex.inputInfo
- The original input infooptimizedJobVertexInputInfo
- The optimized job vertex input info.
-
-