java.lang.Object
- org.deeplearning4j.text.tokenization.tokenizerfactory.BertWordPieceTokenizerFactory

All Implemented Interfaces:: TokenizerFactory

public class BertWordPieceTokenizerFactory
extends Object
implements TokenizerFactory

Constructor Summary

Constructors
Constructor	Description
`BertWordPieceTokenizerFactory(File pathToVocab, boolean lowerCaseOnly, boolean stripAccents, @NonNull Charset charset)`	Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified file. The expected format is a \n seperated list of tokens for vocab entries
`BertWordPieceTokenizerFactory(InputStream vocabInputStream, boolean lowerCaseOnly, boolean stripAccents, @NonNull Charset charset)`	Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified input stream. The expected format for vocabulary is a \n seperated list of tokens for vocab entries
`BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, boolean lowerCaseOnly, boolean stripAccents)`
`BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`Tokenizer`	`create(InputStream toTokenize)`	Create a tokenizer based on an input stream
`Tokenizer`	`create(String toTokenize)`	The tokenizer to createComplex
`Map<String,Integer>`	`getVocab()`
`static NavigableMap<String,Integer>`	`loadVocab(File vocabFile, Charset charset)`
`static NavigableMap<String,Integer>`	`loadVocab(InputStream is, Charset charset)`	The expected format is a \n seperated list of tokens for vocab entries `foo bar baz` the tokens should not have any whitespace on either of their sides

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory
getTokenPreProcessor, setTokenPreProcessor

- Constructor Detail
  - BertWordPieceTokenizerFactory
```
public BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab,
                                     boolean lowerCaseOnly,
                                     boolean stripAccents)
```
    Parameters:
    
    vocab - Vocabulary, as a navigable map
    
    lowerCaseOnly - If true: tokenization should convert all characters to lower case
    
    stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
  - BertWordPieceTokenizerFactory
```
public BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab,
                                     TokenPreProcess preTokenizePreProcessor)
```
    Parameters:
    
    vocab - Vocabulary, as a navigable map
    
    preTokenizePreProcessor - The preprocessor that should be used on the raw strings, before splitting
  - BertWordPieceTokenizerFactory
```
public BertWordPieceTokenizerFactory(File pathToVocab,
                                     boolean lowerCaseOnly,
                                     boolean stripAccents,
                                     @NonNull
                                     @NonNull Charset charset)
                              throws IOException
```
    Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified file.
    The expected format is a \n seperated list of tokens for vocab entries
    
    Parameters:
    
    pathToVocab - Path to vocabulary file
    
    lowerCaseOnly - If true: tokenization should convert all characters to lower case
    
    stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
    
    charset - Character set for the file
    
    Throws:
    
    IOException - If an error occurs reading the vocab file
  - BertWordPieceTokenizerFactory
```
public BertWordPieceTokenizerFactory(InputStream vocabInputStream,
                                     boolean lowerCaseOnly,
                                     boolean stripAccents,
                                     @NonNull
                                     @NonNull Charset charset)
                              throws IOException
```
    Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified input stream.
    The expected format for vocabulary is a \n seperated list of tokens for vocab entries
    
    Parameters:
    
    vocabInputStream - Input stream to load vocabulary
    
    lowerCaseOnly - If true: tokenization should convert all characters to lower case
    
    stripAccents - If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
    
    charset - Character set for the vocab stream
    
    Throws:
    
    IOException - If an error occurs reading the vocab stream
- Method Detail
  - create
```
public Tokenizer create(String toTokenize)
```
    Description copied from interface: TokenizerFactory
    
    The tokenizer to createComplex
    
    Specified by:
    
    create in interface TokenizerFactory
    
    Parameters:
    
    toTokenize - the string to createComplex the tokenizer with
    
    Returns:
    
    the new tokenizer
  - create
```
public Tokenizer create(InputStream toTokenize)
```
    Description copied from interface: TokenizerFactory
    
    Create a tokenizer based on an input stream
    
    Specified by:
    
    create in interface TokenizerFactory
    
    Returns:
  - getVocab
```
public Map<String,Integer> getVocab()
```
  - loadVocab
```
public static NavigableMap<String,Integer> loadVocab(InputStream is,
                                                           Charset charset)
                                                    throws IOException
```
    The expected format is a \n seperated list of tokens for vocab entries foo bar baz the tokens should not have any whitespace on either of their sides
    
    Parameters:
    
    is - InputStream
    
    Returns:
    
    A vocab map with the popper sort order for fast traversal
    
    Throws:
    
    IOException
  - loadVocab
```
public static NavigableMap<String,Integer> loadVocab(File vocabFile,
                                                           Charset charset)
                                                    throws IOException
```
    Throws:
    
    IOException

Class BertWordPieceTokenizerFactory

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory

Constructor Detail

BertWordPieceTokenizerFactory

BertWordPieceTokenizerFactory

BertWordPieceTokenizerFactory

BertWordPieceTokenizerFactory

Method Detail

create

create

getVocab

loadVocab

loadVocab