All Classes and Interfaces

Class

Description

A dictionary interface for abbreviations.

The anchor text is the visible, clickable text in a hyperlink.

Bigrams or digrams are groups of two words, and are very commonly used as the basis for simple statistical analysis of text.

Bigram

Collocations are expressions of multiple words which commonly co-occur.

BM25

The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to term frequency and document length while not introducing too many additional parameters into the model.

BreakIteratorSentenceSplitter

A sentence splitter based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).

BreakIteratorTokenizer

A word tokenizer based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).

CooccurrenceKeywords

Keyword extraction from a single document using word co-occurrence statistical information.

Corpus

A corpus is a collection of documents.

Dictionary

A dictionary is a set of words in some natural language.

EnglishDictionary

A concise dictionary of common terms in English.

EnglishPOSLexicon

An English lexicon with part-of-speech tags.

EnglishPunctuations

Punctuation marks in English.

EnglishStopWords

Several sets of English stop words.

GloVe

Global Vectors for Word Representation.

HMMPOSTagger

Part-of-speech tagging with hidden Markov model.

LancasterStemmer

The Paice/Husk Lancaster stemming algorithm.

NGram

An n-gram is a contiguous sequence of n words from a given sequence of text.

NGram

An n-gram is a contiguous sequence of n words from a given sequence of text.

Normalizer

Normalization transforms text into a canonical form by removing unwanted variations.

ParagraphSplitter

A paragraph splitter segments text into paragraphs.

PennTreebankPOS

The Penn Treebank Tag set.

PennTreebankTokenizer

A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank.

PorterStemmer

Porter's stemming algorithm.

POSTagger

Part-of-speech tagging (POS tagging) is the process of marking up the words in a sentence as corresponding to a particular part of speech.

Punctuations

Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.

Relevance

In the context of information retrieval, relevance denotes how well a retrieved set of documents meets the information need of the user.

RelevanceRanker

An interface to provide relevance ranking algorithm.

SentenceSplitter

A sentence splitter segments text into sentences (a string of words satisfying the grammatical rules of a language).

SimpleCorpus

An in-memory text corpus.

SimpleDictionary

A simple implementation of dictionary interface.

SimpleNormalizer

A baseline normalizer for processing Unicode text.

SimpleParagraphSplitter

This is a simple paragraph splitter.

SimpleSentenceSplitter

This is a simple sentence splitter for English.

SimpleText

A list-of-words representation of documents.

SimpleTokenizer

A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions.

Stemmer

A Stemmer transforms a word into its root form.

StopWords

A set of stop words in some language.

Text

A minimal interface of text in the corpus.

TextTerms

The terms in a text.

TFIDF

The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining.

Tokenizer

A token is a string of characters, categorized according to the rules as a symbol.

Trie<K,V>

A trie, also called digital tree or prefix tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

Word2Vec

Word2vec is a group of related models that are used to produce word embeddings.