com.twitter.algebird

MinHasher

abstract class MinHasher[H] extends Monoid[Array[Byte]]

Instances of MinHasher can create, combine, and compare fixed-sized signatures of arbitrarily sized sets.

A signature is represented by a byte array of approx maxBytes size. You can initialize a signature with a single element, usually a Long or String. You can combine any two set's signatures to produce the signature of their union. You can compare any two set's signatures to estimate their jaccard similarity. You can use a set's signature to estimate the number of distinct values in the set. You can also use a combination of the above to estimate the size of the intersection of two sets from their signatures. The more bytes in the signature, the more accurate all of the above will be.

You can also use these signatures to quickly find similar sets without doing n^2 comparisons. Each signature is assigned to several buckets; sets whose signatures end up in the same bucket are likely to be similar. The targetThreshold controls the desired level of similarity - the higher the threshold, the more efficiently you can find all the similar sets.

This abstract superclass is generic with regards to the size of the hash used. Depending on the number of unique values in the domain of the sets, you may want a MinHasher16, a MinHasher32, or a new custom subclass.

This implementation is modeled after Chapter 3 of Ullman and Rajaraman's Mining of Massive Datasets: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Linear Supertypes
Monoid[Array[Byte]], Semigroup[Array[Byte]], Serializable, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. MinHasher
  2. Monoid
  3. Semigroup
  4. Serializable
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new MinHasher(targetThreshold: Double, maxBytes: Int)(implicit n: Numeric[H])

Abstract Value Members

  1. abstract def buildArray(left: Array[Byte], right: Array[Byte])(fn: (H, H) ⇒ H): Array[Byte]

    Decode two signatures into hash values, combine them somehow, and produce a new array

  2. abstract def buildArray(fn: ⇒ H): Array[Byte]

    Initialize a byte array by generating hash values

  3. abstract def hashSize: Int

    the number of bytes used for each hash in the signature

  4. abstract def maxHash: H

    Maximum value the hash can take on (not 2*hashSize because of signed types)

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def assertNotZero(v: Array[Byte]): Unit

    Definition Classes
    Monoid
  8. def buckets(sig: Array[Byte]): List[Long]

    Bucket keys to use for quickly finding other similar items via locality sensitive hashing

  9. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws()
  10. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  12. val estimatedThreshold: Double

    useful for understanding the effects of numBands and numRows

  13. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws()
  14. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  15. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  16. val hashFunctions: IndexedSeq[MurmurHash128]

    We always use a 128 bit hash function, so the number of hash functions is different (and usually smaller) than the number of hashes in the signature.

  17. def init(fn: (MurmurHash128) ⇒ (Long, Long)): Array[Byte]

    Create a signature for an arbitrary value

  18. def init(value: String): Array[Byte]

    Create a signature for a single String value

  19. def init(value: Long): Array[Byte]

    Create a signature for a single Long value

  20. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  21. def isNonZero(v: Array[Byte]): Boolean

    Definition Classes
    MonoidSemigroup
  22. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  23. def nonZeroOption(v: Array[Byte]): Option[Array[Byte]]

    Definition Classes
    Monoid
  24. final def notify(): Unit

    Definition Classes
    AnyRef
  25. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  26. val numBands: Int

    For explanation of the "bands" and "rows" see Ullman and Rajaraman

  27. val numBytes: Int

  28. val numHashes: Int

  29. val numRows: Int

  30. def pickBands(threshold: Double, hashes: Int): Int

    numerically solve the inverse of estimatedThreshold, given numBands*numRows

  31. def plus(left: Array[Byte], right: Array[Byte]): Array[Byte]

    Set union

    Set union

    Definition Classes
    MinHasherSemigroup
  32. def probabilityOfInclusion(sim: Double): Double

    useful for understanding the effects of numBands and numRows

  33. val seed: Int

    This seed could be anything

  34. def similarity(left: Array[Byte], right: Array[Byte]): Double

    Esimate jaccard similarity (size of union / size of intersection)

  35. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  36. def toString(): String

    Definition Classes
    AnyRef → Any
  37. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()
  38. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()
  39. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws()
  40. val zero: Array[Byte]

    Signature for empty set, needed to be a proper Monoid

    Signature for empty set, needed to be a proper Monoid

    Definition Classes
    MinHasherMonoid

Deprecated Value Members

  1. def sum(vs: TraversableOnce[Array[Byte]]): Array[Byte]

    Definition Classes
    Monoid
    Annotations
    @deprecated
    Deprecated

    Just use Monoid.sum

Inherited from Monoid[Array[Byte]]

Inherited from Semigroup[Array[Byte]]

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped