Class BertWordPieceTokenizer
- java.lang.Object
-
- org.deeplearning4j.text.tokenization.tokenizer.BertWordPieceTokenizer
-
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
BertWordPieceStreamTokenizer
public class BertWordPieceTokenizer extends Object implements Tokenizer
-
-
Field Summary
Fields Modifier and Type Field Description static PatternsplitPattern
-
Constructor Summary
Constructors Constructor Description BertWordPieceTokenizer(String tokens, NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor, TokenPreProcess tokenPreProcess)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidcheckIfEmpty(Map<String,Integer> m, String candidate)intcountTokens()The number of tokens in the tokenizerprotected StringfindLongestSubstring(NavigableMap<String,Integer> vocab, String candidate)List<String>getTokens()Returns a list of all the tokensbooleanhasMoreTokens()An iterator for tracking whether more tokens are left in the iterator notStringnextToken()The next token (word usually) in the stringvoidsetTokenPreProcessor(TokenPreProcess tokenPreProcessor)Set the token pre process
-
-
-
Field Detail
-
splitPattern
public static final Pattern splitPattern
-
-
Constructor Detail
-
BertWordPieceTokenizer
public BertWordPieceTokenizer(String tokens, NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor, TokenPreProcess tokenPreProcess)
-
-
Method Detail
-
hasMoreTokens
public boolean hasMoreTokens()
Description copied from interface:TokenizerAn iterator for tracking whether more tokens are left in the iterator not- Specified by:
hasMoreTokensin interfaceTokenizer- Returns:
- whether there is anymore tokens to iterate over
-
countTokens
public int countTokens()
Description copied from interface:TokenizerThe number of tokens in the tokenizer- Specified by:
countTokensin interfaceTokenizer- Returns:
- the number of tokens
-
nextToken
public String nextToken()
Description copied from interface:TokenizerThe next token (word usually) in the string
-
getTokens
public List<String> getTokens()
Description copied from interface:TokenizerReturns a list of all the tokens
-
setTokenPreProcessor
public void setTokenPreProcessor(TokenPreProcess tokenPreProcessor)
Description copied from interface:TokenizerSet the token pre process- Specified by:
setTokenPreProcessorin interfaceTokenizer- Parameters:
tokenPreProcessor- the token pre processor to set
-
findLongestSubstring
protected String findLongestSubstring(NavigableMap<String,Integer> vocab, String candidate)
-
-