Class SimpleTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class SimpleTokenizer
    extends Object
    implements Tokenizer

    A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.

    This is not multithread safe.

    Author:
    Mathias Mølster Lidal, bratseth
    • Method Detail

      • tokenize

        public Iterable<Token> tokenize​(String input,
                                        Language language,
                                        StemMode stemMode,
                                        boolean removeAccents)
        Description copied from interface: Tokenizer
        Returns the tokens produced from an input string under the rules of the given Language and additional options
        Specified by:
        tokenize in interface Tokenizer
        Parameters:
        input - the string to tokenize. May be arbitrarily large.
        language - the language of the input string.
        stemMode - the stem mode applied on the returned tokens
        removeAccents - if true accents and similar are removed from the returned tokens
        Returns:
        the tokens of the input String.