Splits a sequence of tokens into sentences
Tokenizer using the OpenDomainLexer.g grammar
English open domain tokenizer
Portuguese open domain tokenizer
Tokenizer using the OpenDomainLexer.g grammar
Spanish open domain tokenizer
Tokenizer using the OpenDomainLexer.g grammar
Splits a sequence of Portuguese tokens into sentences
Stores a token as produced by a tokenizer
Splits a sequence of Spanish tokens into sentences
Generic tokenizer Author: mihais Date: 3/15/17
Thin wrapper over the Antlr lexer Author: mihais Date: 3/21/17
Implements one step of a tokenization algorithm, which takes in a sequence of tokens and produces another For example, contractions such as "don't" are handled here; domain-specific operations as well.
Implements one step of a tokenization algorithm, which takes in a sequence of tokens and produces another For example, contractions such as "don't" are handled here; domain-specific operations as well. Note: one constraint that must be obeyed by any TokenizerStep is that RawToken.raw and the corresponding character positions must preserve the original text
Normalize text while keeping crucial accented characters, e.g.
Normalize text while keeping crucial accented characters, e.g. 'รก'.
Resolves English contractions Author: mihais Date: 3/21/17
Resolves Portugese contractions Author: dane Author: mihais Date: 7/10/2018
Resolves Spanish contractions Author: dane Author: mihais Date: 7/23/2018
Stores a token as produced by a tokenizer
The EXACT text tokenized
beginning character offset of raw
end character offset of raw
Normalized form raw, e.g., "'m" becomes "am". Note: these are NOT lemmas.