Package org.deeplearning4j.iterator
Class BertIterator.Builder
- java.lang.Object
-
- org.deeplearning4j.iterator.BertIterator.Builder
-
- Enclosing class:
- BertIterator
public static class BertIterator.Builder extends Object
-
-
Field Summary
Fields Modifier and Type Field Description protected StringappendTokenprotected BertIterator.FeatureArraysfeatureArraysprotected BertIterator.LengthHandlinglengthHandlingprotected BertSequenceMaskermaskerprotected StringmaskTokenprotected intmaxTokensprotected intminibatchSizeprotected booleanpadMinibatchesprotected StringprependTokenprotected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessorpreProcessorprotected LabeledPairSentenceProvidersentencePairProviderprotected LabeledSentenceProvidersentenceProviderprotected BertIterator.Tasktaskprotected TokenizerFactorytokenizerFactoryprotected BertIterator.UnsupervisedLabelFormatunsupervisedLabelFormatprotected Map<String,Integer>vocabMap
-
Constructor Summary
Constructors Constructor Description Builder()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description BertIterator.BuilderappendToken(String appendToken)Append the specified token to the sequences, when doing training on sentence pairs.
Generally "[SEP]" is used No token in appended by default.BertIteratorbuild()BertIterator.BuilderfeatureArrays(BertIterator.FeatureArrays featureArrays)Specify what arrays should be returned.BertIterator.BuilderlengthHandling(@NonNull BertIterator.LengthHandling lengthHandling, int maxLength)Specifies how the sequence length of the output data should be handled.BertIterator.Buildermasker(BertSequenceMasker masker)Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model.BertIterator.BuildermaskToken(String maskToken)Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model.BertIterator.BuilderminibatchSize(int minibatchSize)Minibatch size to use (number of examples to train on for each iteration) See also:padMinibatchesBertIterator.BuilderpadMinibatches(boolean padMinibatches)Default: false (disabled)
If the dataset is not an exact multiple of the minibatch size, should we pad the smaller final minibatch?
For example, if we have 100 examples total, and 32 minibatch size, the following number of examples will be returned for subsequent calls of next() in the one epoch:
padMinibatches = false (default): 32, 32, 32, 4.
padMinibatches = true: 32, 32, 32, 32 (note: the last minibatch will have 4 real examples, and 28 masked out padding examples).
Both options should result in exactly the same model.BertIterator.BuilderprependToken(String prependToken)Prepend the specified token to the sequences, when doing supervised training.
i.e., any token sequences will have this added at the start.
Some BERT/Transformer models may need this - for example sequences starting with a "[CLS]" token.
No token is prepended by default.BertIterator.BuilderpreProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)Set the preprocessor to be used on the MultiDataSets before returning them.BertIterator.BuildersentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)Specify the source of the data for classification on sentence pairs.BertIterator.BuildersentenceProvider(LabeledSentenceProvider sentenceProvider)Specify the source of the data for classification.BertIterator.Buildertask(BertIterator.Task task)Specify theBertIterator.Taskthe iterator should be set up for.BertIterator.Buildertokenizer(TokenizerFactory tokenizerFactory)Specify the TokenizerFactory to use.BertIterator.BuilderunsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model.BertIterator.BuildervocabMap(Map<String,Integer> vocabMap)Provide the vocabulary as a map.
-
-
-
Field Detail
-
task
protected BertIterator.Task task
-
tokenizerFactory
protected TokenizerFactory tokenizerFactory
-
lengthHandling
protected BertIterator.LengthHandling lengthHandling
-
maxTokens
protected int maxTokens
-
minibatchSize
protected int minibatchSize
-
padMinibatches
protected boolean padMinibatches
-
preProcessor
protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor
-
sentenceProvider
protected LabeledSentenceProvider sentenceProvider
-
sentencePairProvider
protected LabeledPairSentenceProvider sentencePairProvider
-
featureArrays
protected BertIterator.FeatureArrays featureArrays
-
masker
protected BertSequenceMasker masker
-
unsupervisedLabelFormat
protected BertIterator.UnsupervisedLabelFormat unsupervisedLabelFormat
-
maskToken
protected String maskToken
-
prependToken
protected String prependToken
-
appendToken
protected String appendToken
-
-
Method Detail
-
task
public BertIterator.Builder task(BertIterator.Task task)
Specify theBertIterator.Taskthe iterator should be set up for. SeeBertIteratorfor more details.
-
tokenizer
public BertIterator.Builder tokenizer(TokenizerFactory tokenizerFactory)
Specify the TokenizerFactory to use. For BERT, typicallyBertWordPieceTokenizerFactoryis used
-
lengthHandling
public BertIterator.Builder lengthHandling(@NonNull @NonNull BertIterator.LengthHandling lengthHandling, int maxLength)
Specifies how the sequence length of the output data should be handled. SeeBertIteratorfor more details.- Parameters:
lengthHandling- Length handlingmaxLength- Not used if LengthHandling is set toBertIterator.LengthHandling.ANY_LENGTH- Returns:
-
minibatchSize
public BertIterator.Builder minibatchSize(int minibatchSize)
Minibatch size to use (number of examples to train on for each iteration) See also:padMinibatches- Parameters:
minibatchSize- Minibatch size
-
padMinibatches
public BertIterator.Builder padMinibatches(boolean padMinibatches)
Default: false (disabled)
If the dataset is not an exact multiple of the minibatch size, should we pad the smaller final minibatch?
For example, if we have 100 examples total, and 32 minibatch size, the following number of examples will be returned for subsequent calls of next() in the one epoch:
padMinibatches = false (default): 32, 32, 32, 4.
padMinibatches = true: 32, 32, 32, 32 (note: the last minibatch will have 4 real examples, and 28 masked out padding examples).
Both options should result in exactly the same model. However, some BERT implementations may require exactly an exact number of examples in all minibatches to function.
-
preProcessor
public BertIterator.Builder preProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)
Set the preprocessor to be used on the MultiDataSets before returning them. Default: none (null)
-
sentenceProvider
public BertIterator.Builder sentenceProvider(LabeledSentenceProvider sentenceProvider)
Specify the source of the data for classification.
-
sentencePairProvider
public BertIterator.Builder sentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)
Specify the source of the data for classification on sentence pairs.
-
featureArrays
public BertIterator.Builder featureArrays(BertIterator.FeatureArrays featureArrays)
Specify what arrays should be returned. SeeBertIteratorfor more details.
-
vocabMap
public BertIterator.Builder vocabMap(Map<String,Integer> vocabMap)
Provide the vocabulary as a map. Keys are the words in the vocabulary, and values are the indices of those words. For indices, they should be in range 0 to vocabMap.size()-1 inclusive.
If usingBertWordPieceTokenizerFactory, this can be obtained usingBertWordPieceTokenizerFactory.getVocab()
-
masker
public BertIterator.Builder masker(BertSequenceMasker masker)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model. This can be used to customize how the masking is performed.
Default:BertMaskedLMMasker
-
unsupervisedLabelFormat
public BertIterator.Builder unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model. Used to specify the format that the labels should be returned in. SeeBertIteratorfor more details.
-
maskToken
public BertIterator.Builder maskToken(String maskToken)
Used only for unsupervised training (i.e., when task is set toBertIterator.Task.UNSUPERVISEDfor learning a masked language model. This specifies the token (such as "[MASK]") that should be used when a value is masked out. Note that this is passed to theBertSequenceMaskerdefined bymasker(BertSequenceMasker)hence the exact behaviour will depend on what masker is used.
Note that this must be in the vocabulary map set invocabMap
-
prependToken
public BertIterator.Builder prependToken(String prependToken)
Prepend the specified token to the sequences, when doing supervised training.
i.e., any token sequences will have this added at the start.
Some BERT/Transformer models may need this - for example sequences starting with a "[CLS]" token.
No token is prepended by default.- Parameters:
prependToken- The token to start each sequence with (null: no token will be prepended)
-
appendToken
public BertIterator.Builder appendToken(String appendToken)
Append the specified token to the sequences, when doing training on sentence pairs.
Generally "[SEP]" is used No token in appended by default.- Parameters:
appendToken- Token at end of each sentence for pairs of sentences (null: no token will be appended)- Returns:
-
build
public BertIterator build()
-
-