Class Tokenizer


  • public final class Tokenizer
    extends java.lang.Object
    Query tokenizer. Singlethreaded.
    Author:
    bratseth
    • Constructor Summary

      Constructors 
      Constructor Description
      Tokenizer​(com.yahoo.language.Linguistics linguistics)
      Creates a tokenizer which initializes from a given Linguistics
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void setSpecialTokens​(SpecialTokens specialTokens)
      Sets a list of tokens (Strings) which should be returned as WORD tokens regardless of their content.
      void setSubstringSpecialTokens​(boolean substringSpecialTokens)
      Sets whether to recognize tokens also as substrings of other tokens, needed for cjk.
      java.util.List<Token> tokenize​(java.lang.String string)
      Resets this tokenizer and create tokens from the given string, using "default" as the default index, and using no index information.
      java.util.List<Token> tokenize​(java.lang.String string, IndexFacts.Session indexFacts)
      Resets this tokenizer and create tokens from the given string, using "default" as the default index
      java.util.List<Token> tokenize​(java.lang.String string, java.lang.String defaultIndexName, IndexFacts.Session indexFacts)
      Resets this tokenizer and create tokens from the given string.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Tokenizer

        public Tokenizer​(com.yahoo.language.Linguistics linguistics)
        Creates a tokenizer which initializes from a given Linguistics
    • Method Detail

      • setSpecialTokens

        public void setSpecialTokens​(SpecialTokens specialTokens)
        Sets a list of tokens (Strings) which should be returned as WORD tokens regardless of their content. This list is used directly by the Tokenizer and should not be changed after calling this. The tokenizer will not change it. Special tokens are case sensitive.
      • setSubstringSpecialTokens

        public void setSubstringSpecialTokens​(boolean substringSpecialTokens)
        Sets whether to recognize tokens also as substrings of other tokens, needed for cjk. Default false.
      • tokenize

        public java.util.List<Token> tokenize​(java.lang.String string)
        Resets this tokenizer and create tokens from the given string, using "default" as the default index, and using no index information.
        Returns:
        a read-only list of tokens. This list can only be used by this thread
      • tokenize

        public java.util.List<Token> tokenize​(java.lang.String string,
                                              IndexFacts.Session indexFacts)
        Resets this tokenizer and create tokens from the given string, using "default" as the default index
        Returns:
        a read-only list of tokens. This list can only be used by this thread
      • tokenize

        public java.util.List<Token> tokenize​(java.lang.String string,
                                              java.lang.String defaultIndexName,
                                              IndexFacts.Session indexFacts)
        Resets this tokenizer and create tokens from the given string.
        Parameters:
        string - the string to tokenize
        defaultIndexName - the name of the index to use as default
        indexFacts - information about the indexes we will search
        Returns:
        a read-only list of tokens. This list can only be used by this thread