Object

org.bdgenomics.utils.minhash

MinHash

Related Doc: package minhash

Permalink

object MinHash extends Serializable

This object presents several methods for determining approximate pair-wise Jaccard similarity through the use of MinHash signatures. A description of this algorithm can be found in chapter 3 of:

Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.

This chapter may be freely (and legally) downloaded from:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. MinHash
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def approximateMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, bands: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

    Permalink

    Implements an approximate pair-wise MinHash similarity check.

    Implements an approximate pair-wise MinHash similarity check. Approximate refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity. This method uses a locality sensitive hashing (LSH) based approach to reduce the number of comparisons required.

    We use the LSH technique described in section 3.4.1 of the Ullman text. This technique creates _b_ bands which divide the hashing space. For a MinHash signature with length _l_, we require b * r = l, where _r_ is the number of rows in each band. For given _b_ and _r_, we expect to compare all elements with similarity greater than (1/b)^(1/r).

    T

    This function will operate on RDDs containing any type T that extends the MinHashable trait.

    rdd

    The RDD of data points to compute similarity on.

    signatureLength

    The length of MinHash signature to use.

    bands

    The number of bands to use for LSHing.

    randomSeed

    An optional seed for random number generation.

    returns

    Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

    Exceptions thrown

    IllegalArgumentException Throws an illegal argument exception if the number of bands does not divide evenly into the signature length.

  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  9. def exactMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

    Permalink

    Implements an exact pair-wise MinHash similarity check.

    Implements an exact pair-wise MinHash similarity check. Exact refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity, and this method _exactly_ compares all pairs of inputs, as opposed to locality sensitive hashing (LSH) based approximations.

    T

    This function will operate on RDDs containing any type T that extends the MinHashable trait.

    rdd

    The RDD of data points to compute similarity on.

    signatureLength

    The length of MinHash signature to use.

    randomSeed

    An optional seed for random number generation.

    returns

    Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

    Note

    This operation may be expensive, as it performs a cartesian product of all elements in the input RDD.

  10. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  16. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  17. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  18. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  19. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped