# MinHash

### Related Doc: package minhash

#### object MinHash extends Serializable

This object presents several methods for determining approximate pair-wise Jaccard similarity through the use of MinHash signatures. A description of this algorithm can be found in chapter 3 of:

Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
1. Alphabetic
2. By Inheritance
Inherited
1. MinHash
2. Serializable
3. Serializable
4. AnyRef
5. Any
1. Hide All
2. Show All
Visibility
1. Public
2. All

### Value Members

1. #### final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
2. #### final def ##(): Int

Definition Classes
AnyRef → Any
3. #### final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
4. #### def approximateMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, bands: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

Implements an approximate pair-wise MinHash similarity check.

Implements an approximate pair-wise MinHash similarity check. Approximate refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity. This method uses a locality sensitive hashing (LSH) based approach to reduce the number of comparisons required.

We use the LSH technique described in section 3.4.1 of the Ullman text. This technique creates _b_ bands which divide the hashing space. For a MinHash signature with length _l_, we require b * r = l, where _r_ is the number of rows in each band. For given _b_ and _r_, we expect to compare all elements with similarity greater than (1/b)^(1/r).

T

This function will operate on RDDs containing any type T that extends the MinHashable trait.

rdd

The RDD of data points to compute similarity on.

signatureLength

The length of MinHash signature to use.

bands

The number of bands to use for LSHing.

randomSeed

An optional seed for random number generation.

returns

Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

Exceptions thrown

`IllegalArgumentException` Throws an illegal argument exception if the number of bands does not divide evenly into the signature length.

5. #### final def asInstanceOf[T0]: T0

Definition Classes
Any
6. #### def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
7. #### final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
8. #### def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
9. #### def exactMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

Implements an exact pair-wise MinHash similarity check.

Implements an exact pair-wise MinHash similarity check. Exact refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity, and this method _exactly_ compares all pairs of inputs, as opposed to locality sensitive hashing (LSH) based approximations.

T

This function will operate on RDDs containing any type T that extends the MinHashable trait.

rdd

The RDD of data points to compute similarity on.

signatureLength

The length of MinHash signature to use.

randomSeed

An optional seed for random number generation.

returns

Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

Note

This operation may be expensive, as it performs a cartesian product of all elements in the input RDD.

10. #### def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
11. #### final def getClass(): Class[_]

Definition Classes
AnyRef → Any
12. #### def hashCode(): Int

Definition Classes
AnyRef → Any
13. #### final def isInstanceOf[T0]: Boolean

Definition Classes
Any
14. #### final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
15. #### final def notify(): Unit

Definition Classes
AnyRef
16. #### final def notifyAll(): Unit

Definition Classes
AnyRef
17. #### final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
18. #### def toString(): String

Definition Classes
AnyRef → Any
19. #### final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
20. #### final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
21. #### final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )