BertIterator is a MultiDataSetIterator for training BERT (Transformer) models in the following way:
(a) Unsupervised - Masked language model task (no sentence matching task is implemented thus far)
(b) Supervised - For sequence classification (i.e., 1 label per sequence, typically used for fine tuning)
The task can be specified using
BertIterator.Task
.
Example for unsupervised training:
BertWordPieceTokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab);
BertIterator b = BertIterator.builder()
.tokenizer(t)
.lengthHandling(BertIterator.LengthHandling.FIXED_LENGTH, 16)
.minibatchSize(2)
.sentenceProvider(<sentence provider here>)
.featureArrays(BertIterator.FeatureArrays.INDICES_MASK)
.vocabMap(t.getVocab())
.task(BertIterator.Task.UNSUPERVISED)
.masker(new BertMaskedLMMasker(new Random(12345), 0.2, 0.5, 0.5))
.unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat.RANK2_IDX)
.maskToken("[MASK]")
.build();
Example for supervised (sequence classification - one label per sequence) training:
BertWordPieceTokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab);
BertIterator b = BertIterator.builder()
.tokenizer(t)
.lengthHandling(BertIterator.LengthHandling.FIXED_LENGTH, 16)
.minibatchSize(2)
.sentenceProvider(new TestSentenceProvider())
.featureArrays(BertIterator.FeatureArrays.INDICES_MASK)
.vocabMap(t.getVocab())
.task(BertIterator.Task.SEQ_CLASSIFICATION)
.build();
This iterator supports numerous ways of configuring the behaviour with respect to the sequence lengths and data layout.
BertIterator.LengthHandling
configuration:
Determines how to handle variable-length sequence situations.
FIXED_LENGTH: Always trim longer sequences to the specified length, and always pad shorter sequences to the specified length.
ANY_LENGTH: Output length is determined by the length of the longest sequence in the minibatch. Shorter sequences within the
minibatch are zero padded and masked.
CLIP_ONLY: For any sequences longer than the specified maximum, clip them. If the maximum sequence length in
a minibatch is shorter than the specified maximum, no padding will occur. For sequences that are shorter than the
maximum (within the current minibatch) they will be zero padded and masked.
BertIterator.FeatureArrays
configuration:
Determines what arrays should be included.
INDICES_MASK: Indices array and mask array only, no segment ID array. Returns 1 feature array, 1 feature mask array (plus labels).
INDICES_MASK_SEGMENTID: Indices array, mask array and segment ID array (which is all 0s for single segment tasks). Returns
2 feature arrays (indices, segment ID) and 1 feature mask array (plus labels)
BertIterator.UnsupervisedLabelFormat
configuration:
Only relevant when the task is set to
BertIterator.Task.UNSUPERVISED
. Determine the format of the labels:
RANK2_IDX: return int32 [minibatch, numTokens] array with entries being class numbers. Example use case: with sparse softmax loss functions.
RANK3_NCL: return float32 [minibatch, numClasses, numTokens] array with 1-hot entries along dimension 1. Example use case: RnnOutputLayer, RnnLossLayer
RANK3_LNC: return float32 [numTokens, minibatch, numClasses] array with 1-hot entries along dimension 2. This format is occasionally
used for some RNN layers in libraries such as TensorFlow, for example