@PublicEvolving public final class DataSetUtils extends Object
| Modifier and Type | Method and Description | 
|---|---|
| static <T> Utils.ChecksumHashCode | checksumHashCode(DataSet<T> input)Deprecated. 
 replaced with  org.apache.flink.graph.asm.dataset.ChecksumHashCodein
     Gelly | 
| static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,Long>> | countElementsPerPartition(DataSet<T> input)Method that goes over all the elements in each partition in order to retrieve the total
 number of elements. | 
| static int | getBitSize(long value) | 
| static <T> PartitionOperator<T> | partitionByRange(DataSet<T> input,
                org.apache.flink.api.common.distributions.DataDistribution distribution,
                int... fields)Range-partitions a DataSet on the specified tuple field positions. | 
| static <T,K extends Comparable<K>> | partitionByRange(DataSet<T> input,
                org.apache.flink.api.common.distributions.DataDistribution distribution,
                org.apache.flink.api.java.functions.KeySelector<T,K> keyExtractor)Range-partitions a DataSet using the specified key selector function. | 
| static <T> PartitionOperator<T> | partitionByRange(DataSet<T> input,
                org.apache.flink.api.common.distributions.DataDistribution distribution,
                String... fields)Range-partitions a DataSet on the specified fields. | 
| static <T> MapPartitionOperator<T,T> | sample(DataSet<T> input,
      boolean withReplacement,
      double fraction)Generate a sample of DataSet by the probability fraction of each element. | 
| static <T> MapPartitionOperator<T,T> | sample(DataSet<T> input,
      boolean withReplacement,
      double fraction,
      long seed)Generate a sample of DataSet by the probability fraction of each element. | 
| static <T> DataSet<T> | sampleWithSize(DataSet<T> input,
              boolean withReplacement,
              int numSamples)Generate a sample of DataSet which contains fixed size elements. | 
| static <T> DataSet<T> | sampleWithSize(DataSet<T> input,
              boolean withReplacement,
              int numSamples,
              long seed)Generate a sample of DataSet which contains fixed size elements. | 
| static <R extends org.apache.flink.api.java.tuple.Tuple,T extends org.apache.flink.api.java.tuple.Tuple> | summarize(DataSet<T> input)Summarize a DataSet of Tuples by collecting single pass statistics for all columns. | 
| static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> | zipWithIndex(DataSet<T> input)Method that assigns a unique  Longvalue to all elements in the input data set. | 
| static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> | zipWithUniqueId(DataSet<T> input)Method that assigns a unique  Longvalue to all elements in the input data set as
 described below. | 
public static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Integer,Long>> countElementsPerPartition(DataSet<T> input)
input - the DataSet received as inputpublic static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> zipWithIndex(DataSet<T> input)
Long value to all elements in the input data set. The
 generated values are consecutive.input - the input data setpublic static <T> DataSet<org.apache.flink.api.java.tuple.Tuple2<Long,T>> zipWithUniqueId(DataSet<T> input)
Long value to all elements in the input data set as
 described below.
 input - the input data setpublic static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction)
withReplacement - Whether element can be selected more than once.fraction - Probability that each element is chosen, should be [0,1] without replacement,
     and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected
     to be selected multi times into sample on average.public static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction, long seed)
withReplacement - Whether element can be selected more than once.fraction - Probability that each element is chosen, should be [0,1] without replacement,
     and [0, ∞) with replacement. While fraction is larger than 1, the elements are expected
     to be selected multi times into sample on average.seed - random number generator seed.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement - Whether element can be selected more than once.numSamples - The expected sample size.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSamples, long seed)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement - Whether element can be selected more than once.numSamples - The expected sample size.seed - Random number generator seed.public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, int... fields)
public static <T> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, String... fields)
public static <T,K extends Comparable<K>> PartitionOperator<T> partitionByRange(DataSet<T> input, org.apache.flink.api.common.distributions.DataDistribution distribution, org.apache.flink.api.java.functions.KeySelector<T,K> keyExtractor)
public static <R extends org.apache.flink.api.java.tuple.Tuple,T extends org.apache.flink.api.java.tuple.Tuple> R summarize(DataSet<T> input) throws Exception
Example usage:
 Dataset<Tuple3<Double, String, Boolean>> input = // [...]
 Tuple3<NumericColumnSummary,StringColumnSummary, BooleanColumnSummary> summary = DataSetUtils.summarize(input)
 summary.f0.getStandardDeviation()
 summary.f1.getMaxLength()
 Exception@Deprecated public static <T> Utils.ChecksumHashCode checksumHashCode(DataSet<T> input) throws Exception
org.apache.flink.graph.asm.dataset.ChecksumHashCode in
     GellyExceptionpublic static int getBitSize(long value)
Copyright © 2014–2021 The Apache Software Foundation. All rights reserved.