Package org.deeplearning4j.iterator
Class BertIterator.Builder
- java.lang.Object
-
- org.deeplearning4j.iterator.BertIterator.Builder
-
- Enclosing class:
- BertIterator
public static class BertIterator.Builder extends Object
-
-
Field Summary
Fields Modifier and Type Field Description protected String
appendToken
protected BertIterator.FeatureArrays
featureArrays
protected BertIterator.LengthHandling
lengthHandling
protected BertSequenceMasker
masker
protected String
maskToken
protected int
maxTokens
protected int
minibatchSize
protected boolean
padMinibatches
protected String
prependToken
protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor
preProcessor
protected LabeledPairSentenceProvider
sentencePairProvider
protected LabeledSentenceProvider
sentenceProvider
protected BertIterator.Task
task
protected TokenizerFactory
tokenizerFactory
protected BertIterator.UnsupervisedLabelFormat
unsupervisedLabelFormat
protected Map<String,Integer>
vocabMap
-
Constructor Summary
Constructors Constructor Description Builder()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description BertIterator.Builder
appendToken(String appendToken)
Append the specified token to the sequences, when doing training on sentence pairs.
Generally "[SEP]" is used No token in appended by default.BertIterator
build()
BertIterator.Builder
featureArrays(BertIterator.FeatureArrays featureArrays)
Specify what arrays should be returned.BertIterator.Builder
lengthHandling(@NonNull BertIterator.LengthHandling lengthHandling, int maxLength)
Specifies how the sequence length of the output data should be handled.BertIterator.Builder
masker(BertSequenceMasker masker)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model.BertIterator.Builder
maskToken(String maskToken)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model.BertIterator.Builder
minibatchSize(int minibatchSize)
Minibatch size to use (number of examples to train on for each iteration) See also:padMinibatches
BertIterator.Builder
padMinibatches(boolean padMinibatches)
Default: false (disabled)
If the dataset is not an exact multiple of the minibatch size, should we pad the smaller final minibatch?
For example, if we have 100 examples total, and 32 minibatch size, the following number of examples will be returned for subsequent calls of next() in the one epoch:
padMinibatches = false (default): 32, 32, 32, 4.
padMinibatches = true: 32, 32, 32, 32 (note: the last minibatch will have 4 real examples, and 28 masked out padding examples).
Both options should result in exactly the same model.BertIterator.Builder
prependToken(String prependToken)
Prepend the specified token to the sequences, when doing supervised training.
i.e., any token sequences will have this added at the start.
Some BERT/Transformer models may need this - for example sequences starting with a "[CLS]" token.
No token is prepended by default.BertIterator.Builder
preProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)
Set the preprocessor to be used on the MultiDataSets before returning them.BertIterator.Builder
sentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)
Specify the source of the data for classification on sentence pairs.BertIterator.Builder
sentenceProvider(LabeledSentenceProvider sentenceProvider)
Specify the source of the data for classification.BertIterator.Builder
task(BertIterator.Task task)
Specify theBertIterator.Task
the iterator should be set up for.BertIterator.Builder
tokenizer(TokenizerFactory tokenizerFactory)
Specify the TokenizerFactory to use.BertIterator.Builder
unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model.BertIterator.Builder
vocabMap(Map<String,Integer> vocabMap)
Provide the vocabulary as a map.
-
-
-
Field Detail
-
task
protected BertIterator.Task task
-
tokenizerFactory
protected TokenizerFactory tokenizerFactory
-
lengthHandling
protected BertIterator.LengthHandling lengthHandling
-
maxTokens
protected int maxTokens
-
minibatchSize
protected int minibatchSize
-
padMinibatches
protected boolean padMinibatches
-
preProcessor
protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor
-
sentenceProvider
protected LabeledSentenceProvider sentenceProvider
-
sentencePairProvider
protected LabeledPairSentenceProvider sentencePairProvider
-
featureArrays
protected BertIterator.FeatureArrays featureArrays
-
masker
protected BertSequenceMasker masker
-
unsupervisedLabelFormat
protected BertIterator.UnsupervisedLabelFormat unsupervisedLabelFormat
-
maskToken
protected String maskToken
-
prependToken
protected String prependToken
-
appendToken
protected String appendToken
-
-
Method Detail
-
task
public BertIterator.Builder task(BertIterator.Task task)
Specify theBertIterator.Task
the iterator should be set up for. SeeBertIterator
for more details.
-
tokenizer
public BertIterator.Builder tokenizer(TokenizerFactory tokenizerFactory)
Specify the TokenizerFactory to use. For BERT, typicallyBertWordPieceTokenizerFactory
is used
-
lengthHandling
public BertIterator.Builder lengthHandling(@NonNull @NonNull BertIterator.LengthHandling lengthHandling, int maxLength)
Specifies how the sequence length of the output data should be handled. SeeBertIterator
for more details.- Parameters:
lengthHandling
- Length handlingmaxLength
- Not used if LengthHandling is set toBertIterator.LengthHandling.ANY_LENGTH
- Returns:
-
minibatchSize
public BertIterator.Builder minibatchSize(int minibatchSize)
Minibatch size to use (number of examples to train on for each iteration) See also:padMinibatches
- Parameters:
minibatchSize
- Minibatch size
-
padMinibatches
public BertIterator.Builder padMinibatches(boolean padMinibatches)
Default: false (disabled)
If the dataset is not an exact multiple of the minibatch size, should we pad the smaller final minibatch?
For example, if we have 100 examples total, and 32 minibatch size, the following number of examples will be returned for subsequent calls of next() in the one epoch:
padMinibatches = false (default): 32, 32, 32, 4.
padMinibatches = true: 32, 32, 32, 32 (note: the last minibatch will have 4 real examples, and 28 masked out padding examples).
Both options should result in exactly the same model. However, some BERT implementations may require exactly an exact number of examples in all minibatches to function.
-
preProcessor
public BertIterator.Builder preProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)
Set the preprocessor to be used on the MultiDataSets before returning them. Default: none (null)
-
sentenceProvider
public BertIterator.Builder sentenceProvider(LabeledSentenceProvider sentenceProvider)
Specify the source of the data for classification.
-
sentencePairProvider
public BertIterator.Builder sentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)
Specify the source of the data for classification on sentence pairs.
-
featureArrays
public BertIterator.Builder featureArrays(BertIterator.FeatureArrays featureArrays)
Specify what arrays should be returned. SeeBertIterator
for more details.
-
vocabMap
public BertIterator.Builder vocabMap(Map<String,Integer> vocabMap)
Provide the vocabulary as a map. Keys are the words in the vocabulary, and values are the indices of those words. For indices, they should be in range 0 to vocabMap.size()-1 inclusive.
If usingBertWordPieceTokenizerFactory
, this can be obtained usingBertWordPieceTokenizerFactory.getVocab()
-
masker
public BertIterator.Builder masker(BertSequenceMasker masker)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model. This can be used to customize how the masking is performed.
Default:BertMaskedLMMasker
-
unsupervisedLabelFormat
public BertIterator.Builder unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model. Used to specify the format that the labels should be returned in. SeeBertIterator
for more details.
-
maskToken
public BertIterator.Builder maskToken(String maskToken)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISED
for learning a masked language model. This specifies the token (such as "[MASK]") that should be used when a value is masked out. Note that this is passed to theBertSequenceMasker
defined bymasker(BertSequenceMasker)
hence the exact behaviour will depend on what masker is used.
Note that this must be in the vocabulary map set invocabMap
-
prependToken
public BertIterator.Builder prependToken(String prependToken)
Prepend the specified token to the sequences, when doing supervised training.
i.e., any token sequences will have this added at the start.
Some BERT/Transformer models may need this - for example sequences starting with a "[CLS]" token.
No token is prepended by default.- Parameters:
prependToken
- The token to start each sequence with (null: no token will be prepended)
-
appendToken
public BertIterator.Builder appendToken(String appendToken)
Append the specified token to the sequences, when doing training on sentence pairs.
Generally "[SEP]" is used No token in appended by default.- Parameters:
appendToken
- Token at end of each sentence for pairs of sentences (null: no token will be appended)- Returns:
-
build
public BertIterator build()
-
-