Class BertWordPieceTokenizerFactory
- java.lang.Object
-
- org.deeplearning4j.text.tokenization.tokenizerfactory.BertWordPieceTokenizerFactory
-
- All Implemented Interfaces:
TokenizerFactory
public class BertWordPieceTokenizerFactory extends Object implements TokenizerFactory
-
-
Constructor Summary
Constructors Constructor Description BertWordPieceTokenizerFactory(File pathToVocab, boolean lowerCaseOnly, boolean stripAccents, @NonNull Charset charset)
Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified file.
The expected format is a \n seperated list of tokens for vocab entriesBertWordPieceTokenizerFactory(InputStream vocabInputStream, boolean lowerCaseOnly, boolean stripAccents, @NonNull Charset charset)
Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified input stream.
The expected format for vocabulary is a \n seperated list of tokens for vocab entriesBertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, boolean lowerCaseOnly, boolean stripAccents)
BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Tokenizer
create(InputStream toTokenize)
Create a tokenizer based on an input streamTokenizer
create(String toTokenize)
The tokenizer to createComplexMap<String,Integer>
getVocab()
static NavigableMap<String,Integer>
loadVocab(File vocabFile, Charset charset)
static NavigableMap<String,Integer>
loadVocab(InputStream is, Charset charset)
The expected format is a \n seperated list of tokens for vocab entriesfoo bar baz
the tokens should not have any whitespace on either of their sides-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory
getTokenPreProcessor, setTokenPreProcessor
-
-
-
-
Constructor Detail
-
BertWordPieceTokenizerFactory
public BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, boolean lowerCaseOnly, boolean stripAccents)
- Parameters:
vocab
- Vocabulary, as a navigable maplowerCaseOnly
- If true: tokenization should convert all characters to lower casestripAccents
- If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow models
-
BertWordPieceTokenizerFactory
public BertWordPieceTokenizerFactory(NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor)
- Parameters:
vocab
- Vocabulary, as a navigable mappreTokenizePreProcessor
- The preprocessor that should be used on the raw strings, before splitting
-
BertWordPieceTokenizerFactory
public BertWordPieceTokenizerFactory(File pathToVocab, boolean lowerCaseOnly, boolean stripAccents, @NonNull @NonNull Charset charset) throws IOException
Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified file.
The expected format is a \n seperated list of tokens for vocab entries- Parameters:
pathToVocab
- Path to vocabulary filelowerCaseOnly
- If true: tokenization should convert all characters to lower casestripAccents
- If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow modelscharset
- Character set for the file- Throws:
IOException
- If an error occurs reading the vocab file
-
BertWordPieceTokenizerFactory
public BertWordPieceTokenizerFactory(InputStream vocabInputStream, boolean lowerCaseOnly, boolean stripAccents, @NonNull @NonNull Charset charset) throws IOException
Create a BertWordPieceTokenizerFactory, load the vocabulary from the specified input stream.
The expected format for vocabulary is a \n seperated list of tokens for vocab entries- Parameters:
vocabInputStream
- Input stream to load vocabularylowerCaseOnly
- If true: tokenization should convert all characters to lower casestripAccents
- If true: strip accents off characters. Usually same as lower case. Should be true when using "uncased" official BERT TensorFlow modelscharset
- Character set for the vocab stream- Throws:
IOException
- If an error occurs reading the vocab stream
-
-
Method Detail
-
create
public Tokenizer create(String toTokenize)
Description copied from interface:TokenizerFactory
The tokenizer to createComplex- Specified by:
create
in interfaceTokenizerFactory
- Parameters:
toTokenize
- the string to createComplex the tokenizer with- Returns:
- the new tokenizer
-
create
public Tokenizer create(InputStream toTokenize)
Description copied from interface:TokenizerFactory
Create a tokenizer based on an input stream- Specified by:
create
in interfaceTokenizerFactory
- Returns:
-
loadVocab
public static NavigableMap<String,Integer> loadVocab(InputStream is, Charset charset) throws IOException
The expected format is a \n seperated list of tokens for vocab entriesfoo bar baz
the tokens should not have any whitespace on either of their sides- Parameters:
is
- InputStream- Returns:
- A vocab map with the popper sort order for fast traversal
- Throws:
IOException
-
loadVocab
public static NavigableMap<String,Integer> loadVocab(File vocabFile, Charset charset) throws IOException
- Throws:
IOException
-
-