Implements an approximate pair-wise MinHash similarity check.
Implements an approximate pair-wise MinHash similarity check. Approximate refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity. This method uses a locality sensitive hashing (LSH) based approach to reduce the number of comparisons required.
We use the LSH technique described in section 3.4.1 of the Ullman text. This technique creates _b_ bands which divide the hashing space. For a MinHash signature with length _l_, we require b * r = l, where _r_ is the number of rows in each band. For given _b_ and _r_, we expect to compare all elements with similarity greater than (1/b)^(1/r).
This function will operate on RDDs containing any type T that extends the MinHashable trait.
The RDD of data points to compute similarity on.
The length of MinHash signature to use.
The number of bands to use for LSHing.
An optional seed for random number generation.
Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).
IllegalArgumentException
Throws an illegal argument exception if
the number of bands does not divide
evenly into the signature length.
Implements an exact pair-wise MinHash similarity check.
Implements an exact pair-wise MinHash similarity check. Exact refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity, and this method _exactly_ compares all pairs of inputs, as opposed to locality sensitive hashing (LSH) based approximations.
This function will operate on RDDs containing any type T that extends the MinHashable trait.
The RDD of data points to compute similarity on.
The length of MinHash signature to use.
An optional seed for random number generation.
Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).
This operation may be expensive, as it performs a cartesian product of all elements in the input RDD.
This object presents several methods for determining approximate pair-wise Jaccard similarity through the use of MinHash signatures. A description of this algorithm can be found in chapter 3 of:
Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.
This chapter may be freely (and legally) downloaded from:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf