Interface Tokenizer


  • public interface Tokenizer
    Language-sensitive tokenization of a text string.
    Author:
    Mathias Mølster Lidal
    • Method Summary

      All Methods Instance Methods Abstract Methods Default Methods 
      Modifier and Type Method Description
      default java.lang.String getReplacementTerm​(java.lang.String tokenString)
      Return a replacement for an input token string.
      java.lang.Iterable<Token> tokenize​(java.lang.String input, Language language, StemMode stemMode, boolean removeAccents)
      Returns the tokens produced from an input string under the rules of the given Language and additional options
    • Method Detail

      • tokenize

        java.lang.Iterable<Token> tokenize​(java.lang.String input,
                                           Language language,
                                           StemMode stemMode,
                                           boolean removeAccents)
        Returns the tokens produced from an input string under the rules of the given Language and additional options
        Parameters:
        input - the string to tokenize. May be arbitrarily large.
        language - the language of the input string.
        stemMode - the stem mode applied on the returned tokens
        removeAccents - if true accents and similar are removed from the returned tokens
        Returns:
        the tokens of the input String.
        Throws:
        ProcessingException - If the underlying library throws an Exception.
      • getReplacementTerm

        default java.lang.String getReplacementTerm​(java.lang.String tokenString)
        Return a replacement for an input token string. This accepts strings returned by Token.getTokenString and returns a replacement which will be used as the index token. The input token string is returned if there is no replacement.

        This default implementation always returns the input token string.

        Parameters:
        tokenString - the token string of the term to lookup a replacement for
        Returns:
        the replacement, if any, or the argument token string if not