Interface | Description |
---|---|
CharacterSubstitutionInterface |
Used to indicate the cost of character substitution.
|
Class | Description |
---|---|
Cosine |
The similarity between the two strings is the cosine of the angle between
these two vectors representation.
|
Damerau |
Implementation of Damerau-Levenshtein distance with transposition (also
sometimes calls unrestricted Damerau-Levenshtein distance).
|
Jaccard |
Each input string is converted into a set of n-grams, the Jaccard index is
then computed as |V1 inter V2| / |V1 union V2|.
|
JaroWinkler |
The Jaro–Winkler distance metric is designed and best suited for short
strings such as person names, and to detect typos; it is (roughly) a
variation of Damerau-Levenshtein, where the substitution of 2 close
characters is considered less important then the substitution of 2 characters
that a far from each other.
|
KShingling |
k-shingling is the operation of transforming a string (or text document) into
a set of n-grams, which can be used to measure the similarity between two
strings or documents.
|
Levenshtein |
The Levenshtein distance between two words is the minimum number of
single-character edits (insertions, deletions or substitutions) required to
change one string into the other.
|
LongestCommonSubsequence |
The longest common subsequence (LCS) problem consists in finding the longest
subsequence common to two (or more) sequences.
|
MetricLCS |
Distance metric based on Longest Common Subsequence, from the notes "An
LCS-based string metric" by Daniel Bakkelund.
|
NGram |
N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance",
String Processing and Information Retrieval, Lecture Notes in Computer
Science Volume 3772, 2005, pp 115-126.
|
NormalizedLevenshtein |
This distance is computed as levenshtein distance divided by the length of
the longest string.
|
QGram |
Q-gram distance, as defined by Ukkonen in "Approximate string-matching with
q-grams and maximal matches".
|
SorensenDice |
Similar to Jaccard index, but this time the similarity is computed as 2 * |V1
inter V2| / (|V1| + |V2|).
|
StringProfile |
Profile of a string (number of occurences of each shingle/n-gram), computed
using shingling.
|
StringSet |
Set representation of a string (list of occuring shingles/n-grams), without
cardinality.
|
WeightedLevenshtein |
Implementation of Levenshtein that allows to define different weights for
different character substitutions.
|
Copyright © 2016. All rights reserved.