Interface Tokenizer


  • public interface Tokenizer
    Language-sensitive tokenization of a text string.
    Author:
    Mathias Mølster Lidal
    • Method Detail

      • tokenize

        java.lang.Iterable<Token> tokenize​(java.lang.String input,
                                           Language language,
                                           StemMode stemMode,
                                           boolean removeAccents)
        Returns the tokens produced from an input string under the rules of the given Language and additional options
        Parameters:
        input - the string to tokenize. May be arbitrarily large.
        language - the language of the input string.
        stemMode - the stem mode applied on the returned tokens
        removeAccents - if true accents and similar are removed from the returned tokens
        Returns:
        the tokens of the input String.
        Throws:
        ProcessingException - If the underlying library throws an Exception.
      • getReplacementTerm

        @Deprecated
        default java.lang.String getReplacementTerm​(java.lang.String tokenString)
        Deprecated.
        replacements are already applied in tokens returned by tokenize
        Not used.