T
- The type of the sampler.@Internal public class ReservoirSamplerWithoutReplacement<T> extends DistributedRandomSampler<T>
DistributedRandomSampler
interface. In the first phase, we generate random numbers as the weights for each element and
select top K elements as the output of each partitions. In the second phase, we select top K
elements from all the outputs of the first phase.
This implementation refers to the algorithm described in "Optimal Random Sampling from Distributed Streams Revisited".
emptyIntermediateIterable, numSamples
emptyIterable, EPSILON
构造器和说明 |
---|
ReservoirSamplerWithoutReplacement(int numSamples)
Create a new sampler with reservoir size and a default random number generator.
|
ReservoirSamplerWithoutReplacement(int numSamples,
long seed)
Create a new sampler with reservoir size and the seed for random number generator.
|
ReservoirSamplerWithoutReplacement(int numSamples,
Random random)
Create a new sampler with reservoir size and a supplied random number generator.
|
限定符和类型 | 方法和说明 |
---|---|
Iterator<IntermediateSampleData<T>> |
sampleInPartition(Iterator<T> input)
Sample algorithm for the first phase.
|
sample, sampleInCoordinator
public ReservoirSamplerWithoutReplacement(int numSamples, Random random)
numSamples
- Maximum number of samples to retain in reservoir, must be non-negative.random
- Instance of random number generator for sampling.public ReservoirSamplerWithoutReplacement(int numSamples)
numSamples
- Maximum number of samples to retain in reservoir, must be non-negative.public ReservoirSamplerWithoutReplacement(int numSamples, long seed)
numSamples
- Maximum number of samples to retain in reservoir, must be non-negative.seed
- Random number generator seed.public Iterator<IntermediateSampleData<T>> sampleInPartition(Iterator<T> input)
DistributedRandomSampler
sampleInPartition
在类中 DistributedRandomSampler<T>
input
- The DataSet input of each partition.Copyright © 2014–2019 The Apache Software Foundation. All rights reserved.