Class AbstractHashSampler

  • All Implemented Interfaces:
    Sampler
    Direct Known Subclasses:
    RowColumnSampler, RowSampler

    public abstract class AbstractHashSampler
    extends Object
    implements Sampler
    A base class that can be used to create Samplers based on hashing. This class offers consistent options for configuring the hash function. The subclass decides which parts of the key to hash.

    This class support two options passed into init(SamplerConfiguration). One option is hasher which specifies a hashing algorithm. Valid values for this option are md5, sha1, and murmur3_32. If you are not sure, then choose murmur3_32.

    The second option is modulus which can have any positive integer as a value.

    Any data where hash(data) % modulus == 0 will be selected for the sample.

    Since:
    1.8.0
    • Constructor Detail

      • AbstractHashSampler

        public AbstractHashSampler()
    • Method Detail

      • isValidOption

        protected boolean isValidOption​(String option)
        Subclasses with options should override this method and return true if the option is valid for the subclass or if super.isValidOption(opt) returns true.
      • init

        public void init​(SamplerConfiguration config)
        Subclasses with options should override this method and call super.init(config).
        Specified by:
        init in interface Sampler
        Parameters:
        config - Configuration options for a sampler.
      • hash

        protected abstract void hash​(DataOutput hasher,
                                     Key k)
                              throws IOException
        Subclass must override this method and hash some portion of the key.
        Parameters:
        hasher - Data written to this will be used to compute the hash for the key.
        Throws:
        IOException
      • accept

        public boolean accept​(Key k)
        Specified by:
        accept in interface Sampler
        Parameters:
        k - A key that was written to a rfile.
        Returns:
        True if the key (and its associated value) should be stored in the rfile's sample. Return false if it should not be included.