Class SimpleTokenizer

java.lang.Object
com.yahoo.language.simple.SimpleTokenizer
All Implemented Interfaces:
Tokenizer

public class SimpleTokenizer extends Object implements Tokenizer

A tokenizer which splits on whitespace, normalizes and transforms using the given implementations and stems using the kstem algorithm.

This is not multithread safe.

Author:
Mathias Mølster Lidal, bratseth
  • Constructor Details

  • Method Details

    • tokenize

      public Iterable<Token> tokenize(String input, Language language, StemMode stemMode, boolean removeAccents)
      Description copied from interface: Tokenizer
      Returns the tokens produced from an input string under the rules of the given Language and additional options
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      input - the string to tokenize. May be arbitrarily large.
      language - the language of the input string.
      stemMode - the stem mode applied on the returned tokens
      removeAccents - if true accents and similar are removed from the returned tokens
      Returns:
      the tokens of the input String.