package distance
- Alphabetic
- By Inheritance
- distance
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
- trait EditDistance[E] extends AnyRef
Value Members
- def affix(string: String)(arity: Int): String
- def bigramsWithAffixing(string: String): Seq[String]
- def ngrams(string: String)(arity: Int): Seq[String]
- def ngramsWithAffixing(string: String)(arity: Int): Seq[String]
- def tokenizeWords(s: String): Array[String]
- def trigramsWithAffixing(string: String): Seq[String]
- object DiceSorensenDistance
-
object
JaroWinklerDistance
The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings.
The Jaro-Winkler com.nexthink.utils.parsing.distance measures the similarity between two strings. This is a metric which is best suited for short strings such as person's names, since it performs a comparison based on a limited window (whereas edit com.nexthink.utils.parsing.distance methods compare all characters)
See https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for the definition. See http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html for a detailed explanation of the algorithm.
-
object
LevenshteinDistance extends EditDistance[Char]
Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric.
Levenshtein com.nexthink.utils.parsing.distance is the classical string difference metric. It is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into another. It is typically implemented with a dynamic programming approach.
See https://en.wikipedia.org/wiki/Levenshtein_distance
-
object
NgramDistance extends EditDistance[String]
N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time.
N-gram edit com.nexthink.utils.parsing.distance is an edit com.nexthink.utils.parsing.distance metric which considers multiple characters at a time. N-gram edit com.nexthink.utils.parsing.distance takes the idea of Levenshtein com.nexthink.utils.parsing.distance and treats each n-gram as a character. The impact of this approach is that insertions and deletions which don't involve double letters are more heavily penalized using n-grams than unigrams. In essence, it introduces a notion of context and favors strings with continuous streches of equal characters (since it multiples the number of comparisons). It is generally used with bigrams, which offer the best efficiency/performance ratio. We also refine this approach with some level of partial credit for n-grams that share common characters. In addition, by using string affixing which allow the first character to participate in the same number of n-grams as an intermediate character. Also, words that don't begin with the same n-1 characters receive a penalty for not matching the prefix.
See http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf (N-Gram Similarity and Distance, Grzegorz Kondrak, 2005) This approach is described in "Taming Text", chapter 4 "Fuzzy string matching", https://www.manning.com/books/taming-text