lsh

Type Members

class BitSet extends Serializable

A simple, fixed-size bit set implementation.
A simple, fixed-size bit set implementation. This implementation is fast because it avoids safety/bound checking.
trait Joiner extends AnyRef
class Lsh extends Joiner with Serializable

Lsh implementation as described in 'Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering' by Ravichandran et al.
Lsh implementation as described in 'Randomized Algorithms and NLP: Using Locality Sensitive Hash Function for High Speed Noun Clustering' by Ravichandran et al. See original publication for a detailed description of the parameters.

See also
http://dl.acm.org/citation.cfm?id=1219917
class NearestNeighbours extends Joiner with Serializable

Brute force O(n^{2) method to compute exact nearest neighbours.
As this is a very expensive computation O(n}2) an additional sample parameter may be passed such that neighbours are just computed for a random fraction.
class QueryHamming extends QueryJoiner with Serializable

Implementation based on approximated cosine distances.
Implementation based on approximated cosine distances. The cosine distances are approximated using hamming distances which are way faster to compute. Either the catalog matrix or the query matrix is broadcasted. This implementation is therefore suited for tasks where one of the matrices is very small (in order to be broadcastet) compared to the query matrix.
trait QueryJoiner extends AnyRef
class QueryLsh extends QueryJoiner with Serializable

Standard Lsh implementation.
Standard Lsh implementation. The queryMatrix is hashed multiple times and exact hash matches are searched for in the dbMatrix. These candidates are used to compute the cosine distance.
class QueryNearestNeighbours extends QueryJoiner with Serializable

Brute force O(size(query) * size(catalog)) method to compute exact nearest neighbours for rows in the query matrix.
Brute force O(size(query) * size(catalog)) method to compute exact nearest neighbours for rows in the query matrix. As this is a very expensive computation, additional sample parameters may be passed such that neighbours are just computed for a random fraction of the data set.
final case class Signature(index: Long, vector: Vector, bitSet: BitSet) extends Ordered[Signature] with Product with Serializable

An id with it's hash encoding and original vector.
class SlidingRDD[T] extends RDD[Array[T]]

Represents an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them.
Represents an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. This is similar to sliding in Scala collections, except that it becomes an empty RDD if the window size is greater than the total number of items. It needs to trigger a Spark job if the parent RDD has more than one partitions. To make this operation efficient, the number of items per partition should be larger than the window size and the window size should be small, e.g., 2.

See also
Int)*
Int)*
class SlidingRDDPartition[T] extends Partition with Serializable

NOTE: both classes are copied from mllib and slightly modified since these classes are mllib private! Modified lines are marked with comments
final case class SparseSignature(index: Long, bitSet: BitSet) extends Ordered[SparseSignature] with Product with Serializable

An id with it's hash encoding.
case class SubBucket(bucketHash: Int, subBucketId: Int = 1) extends Product with Serializable
trait VectorDistance extends Serializable

interface defining similarity measurement between 2 vectors

Value Members

object BitSet extends Serializable
object Cosine extends VectorDistance

implementation of VectorDistance that computes cosine similarity between two vectors
object Main
object SparkImplicits
def bitSetComparator(a: BitSet, b: BitSet): Int

Compares two bit sets according to the first different bit
def bitSetIsEqual(vec1: BitSet, vec2: BitSet): Boolean

Compares two bit sets for their equality
def bitSetToString(bs: BitSet): String

Returns a string representation of a BitSet
def distinct(matrix: RDD[MatrixEntry]): RDD[MatrixEntry]

Take distinct matrix entry values based on the indices only.
Take distinct matrix entry values based on the indices only. The actual values are discarded.
def hamming(vec1: BitSet, vec2: BitSet): Int

Returns the hamming distance between two bit vectors
def hammingToCosine(hammingDistance: Int, d: Double): Double

Approximates the cosine distance of two bit sets using their hamming distance
def localRandomMatrix(d: Int, numFeatures: Int): Matrix

Returns a local k by d matrix with random gaussian entries mean=0.0 and std=1.0
Returns a local k by d matrix with random gaussian entries mean=0.0 and std=1.0
This is a k by d matrix as it is multiplied by the input matrix
def matrixToBitSet(inputMatrix: IndexedRowMatrix, localRandomMatrix: Matrix): RDD[Signature]

Converts a given input matrix to a bit set representation using random hyperplanes
def matrixToBitSetSparse(inputMatrix: IndexedRowMatrix, localRandomMatrix: Matrix): RDD[SparseSignature]

Converts a given input matrix to a bit set representation using random hyperplanes
def vectorToBitSet(vector: Vector): BitSet

Converts a vector to a bit set by replacing all values of x with sign(x)

package lsh

Type Members

class BitSet extends Serializable

trait Joiner extends AnyRef

class Lsh extends Joiner with Serializable

class NearestNeighbours extends Joiner with Serializable

class QueryHamming extends QueryJoiner with Serializable

trait QueryJoiner extends AnyRef

class QueryLsh extends QueryJoiner with Serializable

class QueryNearestNeighbours extends QueryJoiner with Serializable

final case class Signature(index: Long, vector: Vector, bitSet: BitSet) extends Ordered[Signature] with Product with Serializable

class SlidingRDD[T] extends RDD[Array[T]]

class SlidingRDDPartition[T] extends Partition with Serializable

final case class SparseSignature(index: Long, bitSet: BitSet) extends Ordered[SparseSignature] with Product with Serializable

case class SubBucket(bucketHash: Int, subBucketId: Int = 1) extends Product with Serializable

trait VectorDistance extends Serializable

Value Members

object BitSet extends Serializable

object Cosine extends VectorDistance

object Main

object SparkImplicits

def bitSetComparator(a: BitSet, b: BitSet): Int

def bitSetIsEqual(vec1: BitSet, vec2: BitSet): Boolean

def bitSetToString(bs: BitSet): String

def distinct(matrix: RDD[MatrixEntry]): RDD[MatrixEntry]

def hamming(vec1: BitSet, vec2: BitSet): Int

def hammingToCosine(hammingDistance: Int, d: Double): Double

def localRandomMatrix(d: Int, numFeatures: Int): Matrix

def matrixToBitSet(inputMatrix: IndexedRowMatrix, localRandomMatrix: Matrix): RDD[Signature]

def matrixToBitSetSparse(inputMatrix: IndexedRowMatrix, localRandomMatrix: Matrix): RDD[SparseSignature]

def vectorToBitSet(vector: Vector): BitSet

Inherited from AnyRef

Inherited from Any

Ungrouped