public final class DataSetUtils extends Object
Modifier and Type | Method and Description |
---|---|
static int |
getBitSize(long value) |
static <T> MapPartitionOperator<T,T> |
sample(DataSet<T> input,
boolean withReplacement,
double fraction)
Generate a sample of DataSet by the probability fraction of each element.
|
static <T> MapPartitionOperator<T,T> |
sample(DataSet<T> input,
boolean withReplacement,
double fraction,
long seed)
Generate a sample of DataSet by the probability fraction of each element.
|
static <T> DataSet<T> |
sampleWithSize(DataSet<T> input,
boolean withReplacement,
int numSample)
Generate a sample of DataSet which contains fixed size elements.
|
static <T> DataSet<T> |
sampleWithSize(DataSet<T> input,
boolean withReplacement,
int numSample,
long seed)
Generate a sample of DataSet which contains fixed size elements.
|
static <T> DataSet<Tuple2<Long,T>> |
zipWithIndex(DataSet<T> input)
Method that assigns a unique
Long value to all elements in the input data set. |
static <T> DataSet<Tuple2<Long,T>> |
zipWithUniqueId(DataSet<T> input)
Method that assigns a unique
Long value to all elements in the input data set in the following way:
a map function is applied to the input data set
each map task holds a counter c which is increased for each record
c is shifted by n bits where n = log2(number of parallel tasks)
to create a unique ID among all tasks, the task id is added to the counter
for each record, the resulting counter is collected
|
public static <T> DataSet<Tuple2<Long,T>> zipWithIndex(DataSet<T> input)
Long
value to all elements in the input data set. The generated values are
consecutive.input
- the input data setpublic static <T> DataSet<Tuple2<Long,T>> zipWithUniqueId(DataSet<T> input)
Long
value to all elements in the input data set in the following way:
input
- the input data setpublic static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction)
withReplacement
- Whether element can be selected more than once.fraction
- Probability that each element is chosen, should be [0,1] without replacement,
and [0, ∞) with replacement. While fraction is larger than 1, the elements are
expected to be selected multi times into sample on average.public static <T> MapPartitionOperator<T,T> sample(DataSet<T> input, boolean withReplacement, double fraction, long seed)
withReplacement
- Whether element can be selected more than once.fraction
- Probability that each element is chosen, should be [0,1] without replacement,
and [0, ∞) with replacement. While fraction is larger than 1, the elements are
expected to be selected multi times into sample on average.seed
- random number generator seed.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSample)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement
- Whether element can be selected more than once.numSample
- The expected sample size.public static <T> DataSet<T> sampleWithSize(DataSet<T> input, boolean withReplacement, int numSample, long seed)
NOTE: Sample with fixed size is not as efficient as sample with fraction, use sample with fraction unless you need exact precision.
withReplacement
- Whether element can be selected more than once.numSample
- The expected sample size.seed
- Random number generator seed.public static int getBitSize(long value)
Copyright © 2014–2016 The Apache Software Foundation. All rights reserved.