All Classes and Interfaces

Class
Description
A dictionary interface for abbreviations.
The anchor text is the visible, clickable text in a hyperlink.
Bigrams or digrams are groups of two words, and are very commonly used as the basis for simple statistical analysis of text.
Collocations are expressions of multiple words which commonly co-occur.
The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to term frequency and document length while not introducing too many additional parameters into the model.
A sentence splitter based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).
A word tokenizer based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).
Keyword extraction from a single document using word co-occurrence statistical information.
A corpus is a collection of documents.
A dictionary is a set of words in some natural language.
A concise dictionary of common terms in English.
An English lexicon with part-of-speech tags.
Punctuation marks in English.
Several sets of English stop words.
Global Vectors for Word Representation.
Part-of-speech tagging with hidden Markov model.
The Paice/Husk Lancaster stemming algorithm.
An n-gram is a contiguous sequence of n words from a given sequence of text.
An n-gram is a contiguous sequence of n words from a given sequence of text.
Normalization transforms text into a canonical form by removing unwanted variations.
A paragraph splitter segments text into paragraphs.
The Penn Treebank Tag set.
A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank.
Porter's stemming algorithm.
Part-of-speech tagging (POS tagging) is the process of marking up the words in a sentence as corresponding to a particular part of speech.
Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.
In the context of information retrieval, relevance denotes how well a retrieved set of documents meets the information need of the user.
An interface to provide relevance ranking algorithm.
A sentence splitter segments text into sentences (a string of words satisfying the grammatical rules of a language).
An in-memory text corpus.
A simple implementation of dictionary interface.
A baseline normalizer for processing Unicode text.
This is a simple paragraph splitter.
This is a simple sentence splitter for English.
A list-of-words representation of documents.
A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions.
A Stemmer transforms a word into its root form.
A set of stop words in some language.
A minimal interface of text in the corpus.
The terms in a text.
The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining.
A token is a string of characters, categorized according to the rules as a symbol.
A trie, also called digital tree or prefix tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.
Word2vec is a group of related models that are used to produce word embeddings.