All Classes and Interfaces
Class
Description
A dictionary interface for abbreviations.
The anchor text is the visible, clickable text in a hyperlink.
Bigrams or digrams are groups of two words, and are very commonly used
as the basis for simple statistical analysis of text.
Collocations are expressions of multiple words which commonly co-occur.
The BM25 weighting scheme, often called Okapi weighting, after the system in
which it was first implemented, was developed as a way of building a
probabilistic model sensitive to term frequency and document length while
not introducing too many additional parameters into the model.
A sentence splitter based on the java.text.BreakIterator, which supports
multiple natural languages (selected by locale setting).
A word tokenizer based on the java.text.BreakIterator, which supports
multiple natural languages (selected by locale setting).
Keyword extraction from a single document using word co-occurrence statistical information.
A corpus is a collection of documents.
A dictionary is a set of words in some natural language.
A concise dictionary of common terms in English.
An English lexicon with part-of-speech tags.
Punctuation marks in English.
Several sets of English stop words.
Global Vectors for Word Representation.
Part-of-speech tagging with hidden Markov model.
The Paice/Husk Lancaster stemming algorithm.
An n-gram is a contiguous sequence of n words from a given sequence of text.
An n-gram is a contiguous sequence of n words from a given sequence of text.
Normalization transforms text into a canonical form by removing unwanted
variations.
A paragraph splitter segments text into paragraphs.
The Penn Treebank Tag set.
A word tokenizer that tokenizes English sentences using the conventions
used by the Penn Treebank.
Porter's stemming algorithm.
Part-of-speech tagging (POS tagging) is the process of marking up the words
in a sentence as corresponding to a particular part of speech.
Punctuation marks are symbols that indicate the structure and organization
of written language, as well as intonation and pauses to be observed when
reading aloud.
In the context of information retrieval, relevance denotes how well a
retrieved set of documents meets the information need of the user.
An interface to provide relevance ranking algorithm.
A sentence splitter segments text into sentences (a string of words
satisfying the grammatical rules of a language).
An in-memory text corpus.
A simple implementation of dictionary interface.
A baseline normalizer for processing Unicode text.
This is a simple paragraph splitter.
This is a simple sentence splitter for English.
A list-of-words representation of documents.
A word tokenizer that tokenizes English sentences with some differences from
TreebankWordTokenizer, notably on handling not-contractions.
A Stemmer transforms a word into its root form.
A set of stop words in some language.
A minimal interface of text in the corpus.
The terms in a text.
The tf-idf weight (term frequency-inverse document frequency) is a weight
often used in information retrieval and text mining.
A token is a string of characters, categorized according to the rules as a
symbol.
A trie, also called digital tree or prefix tree, is an ordered tree data
structure that is used to store a dynamic set or associative array where
the keys are usually strings.
Word2vec is a group of related models that are used to produce word
embeddings.