public static class BertIterator.Builder extends Object
Modifier and Type | Field and Description |
---|---|
protected String |
appendToken |
protected BertIterator.FeatureArrays |
featureArrays |
protected BertIterator.LengthHandling |
lengthHandling |
protected BertSequenceMasker |
masker |
protected String |
maskToken |
protected int |
maxTokens |
protected int |
minibatchSize |
protected boolean |
padMinibatches |
protected String |
prependToken |
protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor |
preProcessor |
protected LabeledPairSentenceProvider |
sentencePairProvider |
protected LabeledSentenceProvider |
sentenceProvider |
protected BertIterator.Task |
task |
protected TokenizerFactory |
tokenizerFactory |
protected BertIterator.UnsupervisedLabelFormat |
unsupervisedLabelFormat |
protected Map<String,Integer> |
vocabMap |
Constructor and Description |
---|
Builder() |
Modifier and Type | Method and Description |
---|---|
BertIterator.Builder |
appendToken(String appendToken)
Append the specified token to the sequences, when doing training on sentence pairs.
Generally "[SEP]" is used No token in appended by default. |
BertIterator |
build() |
BertIterator.Builder |
featureArrays(BertIterator.FeatureArrays featureArrays)
Specify what arrays should be returned.
|
BertIterator.Builder |
lengthHandling(@NonNull BertIterator.LengthHandling lengthHandling,
int maxLength)
Specifies how the sequence length of the output data should be handled.
|
BertIterator.Builder |
masker(BertSequenceMasker masker)
Used only for unsupervised training (i.e., when task is set to
BertIterator.Task.UNSUPERVISED for learning a
masked language model. |
BertIterator.Builder |
maskToken(String maskToken)
Used only for unsupervised training (i.e., when task is set to
BertIterator.Task.UNSUPERVISED for learning a
masked language model. |
BertIterator.Builder |
minibatchSize(int minibatchSize)
Minibatch size to use (number of examples to train on for each iteration)
See also:
padMinibatches |
BertIterator.Builder |
padMinibatches(boolean padMinibatches)
Default: false (disabled)
If the dataset is not an exact multiple of the minibatch size, should we pad the smaller final minibatch? For example, if we have 100 examples total, and 32 minibatch size, the following number of examples will be returned for subsequent calls of next() in the one epoch: padMinibatches = false (default): 32, 32, 32, 4. padMinibatches = true: 32, 32, 32, 32 (note: the last minibatch will have 4 real examples, and 28 masked out padding examples). Both options should result in exactly the same model. |
BertIterator.Builder |
prependToken(String prependToken)
Prepend the specified token to the sequences, when doing supervised training.
i.e., any token sequences will have this added at the start. Some BERT/Transformer models may need this - for example sequences starting with a "[CLS]" token. No token is prepended by default. |
BertIterator.Builder |
preProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)
Set the preprocessor to be used on the MultiDataSets before returning them.
|
BertIterator.Builder |
sentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)
Specify the source of the data for classification on sentence pairs.
|
BertIterator.Builder |
sentenceProvider(LabeledSentenceProvider sentenceProvider)
Specify the source of the data for classification.
|
BertIterator.Builder |
task(BertIterator.Task task)
Specify the
BertIterator.Task the iterator should be set up for. |
BertIterator.Builder |
tokenizer(TokenizerFactory tokenizerFactory)
Specify the TokenizerFactory to use.
|
BertIterator.Builder |
unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)
Used only for unsupervised training (i.e., when task is set to
BertIterator.Task.UNSUPERVISED for learning a
masked language model. |
BertIterator.Builder |
vocabMap(Map<String,Integer> vocabMap)
Provide the vocabulary as a map.
|
protected BertIterator.Task task
protected TokenizerFactory tokenizerFactory
protected BertIterator.LengthHandling lengthHandling
protected int maxTokens
protected int minibatchSize
protected boolean padMinibatches
protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor
protected LabeledSentenceProvider sentenceProvider
protected LabeledPairSentenceProvider sentencePairProvider
protected BertIterator.FeatureArrays featureArrays
protected BertSequenceMasker masker
protected BertIterator.UnsupervisedLabelFormat unsupervisedLabelFormat
protected String maskToken
protected String prependToken
protected String appendToken
public BertIterator.Builder task(BertIterator.Task task)
BertIterator.Task
the iterator should be set up for. See BertIterator
for more details.public BertIterator.Builder tokenizer(TokenizerFactory tokenizerFactory)
BertWordPieceTokenizerFactory
is usedpublic BertIterator.Builder lengthHandling(@NonNull @NonNull BertIterator.LengthHandling lengthHandling, int maxLength)
BertIterator
for more details.lengthHandling
- Length handlingmaxLength
- Not used if LengthHandling is set to BertIterator.LengthHandling.ANY_LENGTH
public BertIterator.Builder minibatchSize(int minibatchSize)
padMinibatches
minibatchSize
- Minibatch sizepublic BertIterator.Builder padMinibatches(boolean padMinibatches)
public BertIterator.Builder preProcessor(org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor)
public BertIterator.Builder sentenceProvider(LabeledSentenceProvider sentenceProvider)
public BertIterator.Builder sentencePairProvider(LabeledPairSentenceProvider sentencePairProvider)
public BertIterator.Builder featureArrays(BertIterator.FeatureArrays featureArrays)
BertIterator
for more details.public BertIterator.Builder vocabMap(Map<String,Integer> vocabMap)
BertWordPieceTokenizerFactory
,
this can be obtained using BertWordPieceTokenizerFactory.getVocab()
public BertIterator.Builder masker(BertSequenceMasker masker)
BertIterator.Task.UNSUPERVISED
for learning a
masked language model. This can be used to customize how the masking is performed.BertMaskedLMMasker
public BertIterator.Builder unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat labelFormat)
BertIterator.Task.UNSUPERVISED
for learning a
masked language model. Used to specify the format that the labels should be returned in.
See BertIterator
for more details.public BertIterator.Builder maskToken(String maskToken)
BertIterator.Task.UNSUPERVISED
for learning a
masked language model. This specifies the token (such as "[MASK]") that should be used when a value is masked out.
Note that this is passed to the BertSequenceMasker
defined by masker(BertSequenceMasker)
hence
the exact behaviour will depend on what masker is used.vocabMap
public BertIterator.Builder prependToken(String prependToken)
prependToken
- The token to start each sequence with (null: no token will be prepended)public BertIterator.Builder appendToken(String appendToken)
appendToken
- Token at end of each sentence for pairs of sentences (null: no token will be appended)public BertIterator build()
Copyright © 2022. All rights reserved.