Do the BPE algorithm.
Do the BPE algorithm. Goal is to find the token as the largest words in the known vocabulary. If not possible, the word is split into smaller subwords, until they are known.
Array of TokenPieces, corresponding to encoded token
cache for already encoded tokens
cache for already encoded tokens
Rankings for the byte pairs.
Rankings for the byte pairs. Derived from merges.txt
Create a sequence of byte-pairs of the word
Create a sequence of byte-pairs of the word
Special tokens of the model for processing
Special tokens of the model for processing
Split the the individual sub texts on special tokens, e.g.
Split the the individual sub texts on special tokens, e.g. masking etc.
Tokenize considering special tokens and split algorithm
Tokenize considering special tokens and split algorithm
Needs to be implemented
Needs to be implemented