Interface Tokenizer

  • All Known Implementing Classes:
    OpenNlpTokenizer, SimpleTokenizer

    public interface Tokenizer
    Language-sensitive tokenization of a text string.
    Author:
    Mathias Mølster Lidal
    • Method Detail

      • tokenize

        Iterable<Token> tokenize​(String input,
                                 Language language,
                                 StemMode stemMode,
                                 boolean removeAccents)
        Returns the tokens produced from an input string under the rules of the given Language and additional options
        Parameters:
        input - the string to tokenize. May be arbitrarily large.
        language - the language of the input string.
        stemMode - the stem mode applied on the returned tokens
        removeAccents - if true accents and similar are removed from the returned tokens
        Returns:
        the tokens of the input String.
        Throws:
        ProcessingException - If the underlying library throws an Exception.
      • getReplacementTerm

        @Deprecated
        default String getReplacementTerm​(String tokenString)
        Deprecated.
        replacements are already applied in tokens returned by tokenize
        Not used.