Class BertWordPieceTokenizer
- java.lang.Object
-
- org.deeplearning4j.text.tokenization.tokenizer.BertWordPieceTokenizer
-
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
BertWordPieceStreamTokenizer
public class BertWordPieceTokenizer extends Object implements Tokenizer
-
-
Field Summary
Fields Modifier and Type Field Description static Pattern
splitPattern
-
Constructor Summary
Constructors Constructor Description BertWordPieceTokenizer(String tokens, NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor, TokenPreProcess tokenPreProcess)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
checkIfEmpty(Map<String,Integer> m, String candidate)
int
countTokens()
The number of tokens in the tokenizerprotected String
findLongestSubstring(NavigableMap<String,Integer> vocab, String candidate)
List<String>
getTokens()
Returns a list of all the tokensboolean
hasMoreTokens()
An iterator for tracking whether more tokens are left in the iterator notString
nextToken()
The next token (word usually) in the stringvoid
setTokenPreProcessor(TokenPreProcess tokenPreProcessor)
Set the token pre process
-
-
-
Field Detail
-
splitPattern
public static final Pattern splitPattern
-
-
Constructor Detail
-
BertWordPieceTokenizer
public BertWordPieceTokenizer(String tokens, NavigableMap<String,Integer> vocab, TokenPreProcess preTokenizePreProcessor, TokenPreProcess tokenPreProcess)
-
-
Method Detail
-
hasMoreTokens
public boolean hasMoreTokens()
Description copied from interface:Tokenizer
An iterator for tracking whether more tokens are left in the iterator not- Specified by:
hasMoreTokens
in interfaceTokenizer
- Returns:
- whether there is anymore tokens to iterate over
-
countTokens
public int countTokens()
Description copied from interface:Tokenizer
The number of tokens in the tokenizer- Specified by:
countTokens
in interfaceTokenizer
- Returns:
- the number of tokens
-
nextToken
public String nextToken()
Description copied from interface:Tokenizer
The next token (word usually) in the string
-
getTokens
public List<String> getTokens()
Description copied from interface:Tokenizer
Returns a list of all the tokens
-
setTokenPreProcessor
public void setTokenPreProcessor(TokenPreProcess tokenPreProcessor)
Description copied from interface:Tokenizer
Set the token pre process- Specified by:
setTokenPreProcessor
in interfaceTokenizer
- Parameters:
tokenPreProcessor
- the token pre processor to set
-
findLongestSubstring
protected String findLongestSubstring(NavigableMap<String,Integer> vocab, String candidate)
-
-