MinHash

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def approximateMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, bands: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

Implements an approximate pair-wise MinHash similarity check.
Implements an approximate pair-wise MinHash similarity check. Approximate refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity. This method uses a locality sensitive hashing (LSH) based approach to reduce the number of comparisons required.
We use the LSH technique described in section 3.4.1 of the Ullman text. This technique creates _b_ bands which divide the hashing space. For a MinHash signature with length _l_, we require b * r = l, where _r_ is the number of rows in each band. For given _b_ and _r_, we expect to compare all elements with similarity greater than (1/b)^(1/r).
T
This function will operate on RDDs containing any type T that extends the MinHashable trait.
rdd
The RDD of data points to compute similarity on.
signatureLength
The length of MinHash signature to use.
bands
The number of bands to use for LSHing.
randomSeed
An optional seed for random number generation.
returns
Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

Exceptions thrown
IllegalArgumentException Throws an illegal argument exception if the number of bands does not divide evenly into the signature length.
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def exactMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

Implements an exact pair-wise MinHash similarity check.
Implements an exact pair-wise MinHash similarity check. Exact refers to "all-pairs", not "similarity"; MinHash signature comparison approximates Jaccard similarity, and this method _exactly_ compares all pairs of inputs, as opposed to locality sensitive hashing (LSH) based approximations.
T
This function will operate on RDDs containing any type T that extends the MinHashable trait.
rdd
The RDD of data points to compute similarity on.
signatureLength
The length of MinHash signature to use.
randomSeed
An optional seed for random number generation.
returns
Returns an RDD containing all pairs of elements, with their similarity, as a tuple of (similarity, (elem1, elem2)).

Note
This operation may be expensive, as it performs a cartesian product of all elements in the input RDD.
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package minhash

object MinHash extends Serializable

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

def approximateMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, bands: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def exactMinHash[T <: MinHashable](rdd: RDD[T], signatureLength: Int, randomSeed: Option[Long] = None): RDD[(Double, (T, T))]

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped