Interface Tokenizer.Configuration

Enclosing interface:
Tokenizer

public static interface Tokenizer.Configuration
A nested interface representing the configuration options for this tokenizer. Implementors of this interface can set the maximum number of tokens, the maximum overlap between tokens, and the type of tokenization being performed.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    setMaxOverlap(int maxOverlap)
    Sets the maximum overlap between tokens, where an overlap is defined as the number of characters that are common between two adjacent segments.
    void
    setMaxSegmentSize(int maxSegmentSize)
    Sets the maximum size of the segment to be tokenized and produced by the tokenizer.
    void
    setMaxTokens(int maxTokens)
    Deprecated, for removal: This API element is subject to removal in a future version.
    void
    Sets the underlying model used by the application.
    void
    Sets the type of tokenization being performed by this tokenizer.
  • Method Details

    • setMaxSegmentSize

      void setMaxSegmentSize(int maxSegmentSize)
      Sets the maximum size of the segment to be tokenized and produced by the tokenizer. It can be defined either based on the number of tokens, in which case, the model name must be provided via setModelName, or the maximum number of characters.
      Parameters:
      maxSegmentSize - the new maximum size of the segment
      See Also:
    • setMaxTokens

      @Deprecated(since="4.12.0", forRemoval=true) void setMaxTokens(int maxTokens)
      Deprecated, for removal: This API element is subject to removal in a future version.
      Sets the maximum number of tokens to be produced by the tokenizer. Use setMaxSegmentSize instead.
      Parameters:
      maxTokens - the new maximum number of tokens
      See Also:
    • setMaxOverlap

      void setMaxOverlap(int maxOverlap)
      Sets the maximum overlap between tokens, where an overlap is defined as the number of characters that are common between two adjacent segments.
      Parameters:
      maxOverlap - the new maximum overlap
    • setType

      void setType(String type)
      Sets the type of tokenization being performed by this tokenizer. This can typically be specific to the implementation.
      Parameters:
      type - the tokenization type
    • setModelName

      void setModelName(String type)
      Sets the underlying model used by the application. This can be useful when it is necessary to know in advance the cost of processing a specified text by the given model. By providing this, it effectively switches to computing the segment sizes in terms of tokens.
      Parameters:
      type - the tokenization type