Class BertWordPieceTokenizerFactory

    • Constructor Detail

      • BertWordPieceTokenizerFactory

        public BertWordPieceTokenizerFactory​(NavigableMap<String,​Integer> vocab,
                                             boolean lowerCaseOnly,
                                             boolean stripAccents)
        Parameters:
        vocab - Vocabulary, as a navigable map
        lowerCaseOnly - If true: tokenization should convert all characters to lower case
        stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
      • BertWordPieceTokenizerFactory

        public BertWordPieceTokenizerFactory​(NavigableMap<String,​Integer> vocab,
                                             TokenPreProcess preTokenizePreProcessor)
        Parameters:
        vocab - Vocabulary, as a navigable map
        preTokenizePreProcessor - The preprocessor that should be used on the raw strings, before splitting
      • BertWordPieceTokenizerFactory

        public BertWordPieceTokenizerFactory​(File pathToVocab,
                                             boolean lowerCaseOnly,
                                             boolean stripAccents,
                                             @NonNull
                                             @NonNull Charset charset)
                                      throws IOException
        Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified file.
        The expected format is a \n seperated list of tokens for vocab entries
        Parameters:
        pathToVocab - Path to vocabulary file
        lowerCaseOnly - If true: tokenization should convert all characters to lower case
        stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
        charset - Character set for the file
        Throws:
        IOException - If an error occurs reading the vocab file
      • BertWordPieceTokenizerFactory

        public BertWordPieceTokenizerFactory​(InputStream vocabInputStream,
                                             boolean lowerCaseOnly,
                                             boolean stripAccents,
                                             @NonNull
                                             @NonNull Charset charset)
                                      throws IOException
        Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified input stream.
        The expected format for vocabulary is a \n seperated list of tokens for vocab entries
        Parameters:
        vocabInputStream - Input stream to load vocabulary
        lowerCaseOnly - If true: tokenization should convert all characters to lower case
        stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
        charset - Character set for the vocab stream
        Throws:
        IOException - If an error occurs reading the vocab stream