A Sentence Segmenter backed by Java's BreakIterator.
A Sentence Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences
A Word Segmenter backed by Java's BreakIterator.
A Word Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences Doesn't return spaces, does return punctuation.
Finds all occurrences of the given pattern in the document.
Splits the input document according to the given pattern.
Splits the input document according to the given pattern. Does not return the splits.
TODO
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens.
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens. Tokenizers work with both raw strings and epic.slab.StringSlabs.
Tokenizes by splitting on the regular expression \s+.
A simple regex sentence segmenter.
Just a simple thing for me to learn Tika
TODO
TODO