Package | Description |
---|---|
info.debatty.java.stringsimilarity | |
info.debatty.java.stringsimilarity.interfaces |
Modifier and Type | Class and Description |
---|---|
class |
Cosine
The similarity between the two strings is the cosine of the angle between
these two vectors representation.
|
class |
Damerau
Implementation of Damerau-Levenshtein distance, computed as the
minimum number of operations needed to transform one string into the other,
where an operation is defined as an insertion, deletion, or substitution of a
single character, or a transposition of two adjacent characters.
|
class |
Jaccard
Each input string is converted into a set of n-grams, the Jaccard index is
then computed as |V1 inter V2| / |V1 union V2|.
|
class |
JaroWinkler
The Jaro–Winkler distance metric is designed and best suited for short
strings such as person names, and to detect typos; it is (roughly) a
variation of Damerau-Levenshtein, where the substitution of 2 close
characters is considered less important then the substitution of 2 characters
that a far from each other.
|
class |
Levenshtein
The Levenshtein distance between two words is the minimum number of
single-character edits (insertions, deletions or substitutions) required to
change one word into the other.
|
class |
LongestCommonSubsequence
The longest common subsequence (LCS) problem consists in finding the
longest subsequence common to two (or more) sequences.
|
class |
MetricLCS
Distance metric based on Longest Common Subsequence, from the notes "An
LCS-based string metric" by Daniel Bakkelund.
|
class |
NGram
N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance",
String Processing and Information Retrieval, Lecture Notes in Computer
Science Volume 3772, 2005, pp 115-126.
|
class |
NormalizedLevenshtein
This distance is computed as levenshtein distance divided by the length of
the longest string.
|
class |
QGram
Q-gram distance, as defined by Ukkonen in "Approximate string-matching with
q-grams and maximal matches".
|
class |
SorensenDice
Similar to Jaccard index, but this time the similarity is computed as
2 * |V1 inter V2| / (|V1| + |V2|).
|
class |
WeightedLevenshtein
Implementation of Levenshtein that allows to define different weights for
different character substitutions.
|
Modifier and Type | Interface and Description |
---|---|
interface |
MetricStringDistance
String distances that implement this interface are metrics, which means:
d(x, y) ≥ 0 (non-negativity, or separation axiom)
d(x, y) = 0 if and only if x = y (identity, or coincidence axiom)
d(x, y) = d(y, x) (symmetry)
d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
|
interface |
NormalizedStringDistance
Normalized string similarities return a similarity between 0.0 and 1.0.
|
Copyright © 2015. All rights reserved.