- All Implemented Interfaces:
- Serializable, Cloneable, Iterable<String>, Collection<String>, Set<String>
public class KShingling
extends HashSet<String>
implements Serializable
A k-shingling is a set of unique k-grams, used to measure the similarity of
two documents.
Generally speaking, a k-gram is any sequence of k tokens. We use here the
definition from Leskovec, Rajaraman & Ullman (2014),
"Mining of Massive Datasets", Cambridge University Press:
Multiple subsequent spaces are replaced by a single space, and a k-gram is a
sequence of k characters.
- Author:
- Thibault Debatty http://www.debatty.info
- See Also:
- Serialized Form