minimum similarity two items need to have otherwise they are discarded from the result set
number of random vectors (hyperplanes) to generate bit vectors of length d
beam factor e.g. how many neighbours are considered in the sliding window
number of times bitsets are permuted
Compares two bit sets for their equality
Creates a sliding window
Generates a random permutation of size n
Returns the hamming distance between two bit vectors
Approximates the cosine distance of two bit sets using their hamming distance
Returns a local k by d matrix with random gaussian entries mean=0.
Returns a local k by d matrix with random gaussian entries mean=0.0 and std=1.0
This is a k by d matrix as it is multiplied by the input matrix
Converts a given input matrix to a bit set representation using random hyperplanes
Generate all pairs and emit if cosine of pair > minCosineSimilarity
Orderes an RDD of signatures by their bit set representation
Permutes a bit set representation of a vector by a given permutation
Permutes a signatures by a given permutation
Draws a random number with mean 0 and standard deviation of 1
Converts a vector to a bit set by replacing all values of x with sign(x)
Lsh implementation as described in 'Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering' by Ravichandran et al. See original publication for a detailed description of the parameters.