T - The type of the sampler.@Internal public class ReservoirSamplerWithoutReplacement<T> extends DistributedRandomSampler<T>
DistributedRandomSampler interface. In
 the first phase, we generate random numbers as the weights for each element and select top K
 elements as the output of each partitions. In the second phase, we select top K elements from all
 the outputs of the first phase.
 This implementation refers to the algorithm described in "Optimal Random Sampling from Distributed Streams Revisited".
emptyIntermediateIterable, numSamplesemptyIterable, EPSILON| Constructor and Description | 
|---|
| ReservoirSamplerWithoutReplacement(int numSamples)Create a new sampler with reservoir size and a default random number generator. | 
| ReservoirSamplerWithoutReplacement(int numSamples,
                                  long seed)Create a new sampler with reservoir size and the seed for random number generator. | 
| ReservoirSamplerWithoutReplacement(int numSamples,
                                  Random random)Create a new sampler with reservoir size and a supplied random number generator. | 
| Modifier and Type | Method and Description | 
|---|---|
| Iterator<IntermediateSampleData<T>> | sampleInPartition(Iterator<T> input)Sample algorithm for the first phase. | 
sample, sampleInCoordinatorpublic ReservoirSamplerWithoutReplacement(int numSamples,
                                          Random random)
numSamples - Maximum number of samples to retain in reservoir, must be non-negative.random - Instance of random number generator for sampling.public ReservoirSamplerWithoutReplacement(int numSamples)
numSamples - Maximum number of samples to retain in reservoir, must be non-negative.public ReservoirSamplerWithoutReplacement(int numSamples,
                                          long seed)
numSamples - Maximum number of samples to retain in reservoir, must be non-negative.seed - Random number generator seed.public Iterator<IntermediateSampleData<T>> sampleInPartition(Iterator<T> input)
DistributedRandomSamplersampleInPartition in class DistributedRandomSampler<T>input - The DataSet input of each partition.Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.