BertIterator (deeplearning4j-nlp 1.0.0-beta5 API)

java.lang.Object
- org.deeplearning4j.iterator.BertIterator

All Implemented Interfaces:

Serializable, Iterator<org.nd4j.linalg.dataset.api.MultiDataSet>, org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
```
public class BertIterator
extends Object
implements org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
```
BertIterator is a MultiDataSetIterator for training BERT (Transformer) models in the following way:
(a) Unsupervised - Masked language model task (no sentence matching task is implemented thus far)
(b) Supervised - For sequence classification (i.e., 1 label per sequence, typically used for fine tuning)
The task can be specified using BertIterator.Task.
Example for unsupervised training:
```
 
          BertWordPieceTokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab);
          BertIterator b = BertIterator.builder()
              .tokenizer(t)
              .lengthHandling(BertIterator.LengthHandling.FIXED_LENGTH, 16)
              .minibatchSize(2)
              .sentenceProvider(<sentence provider here>)
              .featureArrays(BertIterator.FeatureArrays.INDICES_MASK)
              .vocabMap(t.getVocab())
              .task(BertIterator.Task.UNSUPERVISED)
              .masker(new BertMaskedLMMasker(new Random(12345), 0.2, 0.5, 0.5))
              .unsupervisedLabelFormat(BertIterator.UnsupervisedLabelFormat.RANK2_IDX)
              .maskToken("[MASK]")
              .build();
 
 
```
Example for supervised (sequence classification - one label per sequence) training:
```
 
          BertWordPieceTokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab);
          BertIterator b = BertIterator.builder()
              .tokenizer(t)
              .lengthHandling(BertIterator.LengthHandling.FIXED_LENGTH, 16)
              .minibatchSize(2)
              .sentenceProvider(new TestSentenceProvider())
              .featureArrays(BertIterator.FeatureArrays.INDICES_MASK)
              .vocabMap(t.getVocab())
              .task(BertIterator.Task.SEQ_CLASSIFICATION)
              .build();
 
 
```
This iterator supports numerous ways of configuring the behaviour with respect to the sequence lengths and data layout.

BertIterator.LengthHandling configuration:
Determines how to handle variable-length sequence situations.
FIXED_LENGTH: Always trim longer sequences to the specified length, and always pad shorter sequences to the specified length.
ANY_LENGTH: Output length is determined by the length of the longest sequence in the minibatch. Shorter sequences within the minibatch are zero padded and masked.
CLIP_ONLY: For any sequences longer than the specified maximum, clip them. If the maximum sequence length in a minibatch is shorter than the specified maximum, no padding will occur. For sequences that are shorter than the maximum (within the current minibatch) they will be zero padded and masked.

BertIterator.FeatureArrays configuration:
Determines what arrays should be included.
INDICES_MASK: Indices array and mask array only, no segment ID array. Returns 1 feature array, 1 feature mask array (plus labels).
INDICES_MASK_SEGMENTID: Indices array, mask array and segment ID array (which is all 0s for single segment tasks). Returns 2 feature arrays (indices, segment ID) and 1 feature mask array (plus labels)

BertIterator.UnsupervisedLabelFormat configuration:
Only relevant when the task is set to BertIterator.Task.UNSUPERVISED. Determine the format of the labels:
RANK2_IDX: return int32 [minibatch, numTokens] array with entries being class numbers. Example use case: with sparse softmax loss functions.
RANK3_NCL: return float32 [minibatch, numClasses, numTokens] array with 1-hot entries along dimension 1. Example use case: RnnOutputLayer, RnnLossLayer
RANK3_LNC: return float32 [numTokens, minibatch, numClasses] array with 1-hot entries along dimension 2. This format is occasionally used for some RNN layers in libraries such as TensorFlow, for example
See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`BertIterator.Builder`
`static class`	`BertIterator.FeatureArrays`
`static class`	`BertIterator.LengthHandling`
`static class`	`BertIterator.Task`
`static class`	`BertIterator.UnsupervisedLabelFormat`

Field Summary

Fields
Modifier and Type	Field and Description
`protected BertIterator.FeatureArrays`	`featureArrays`
`protected BertIterator.LengthHandling`	`lengthHandling`
`protected BertSequenceMasker`	`masker`
`protected String`	`maskToken`
`protected int`	`maxTokens`
`protected int`	`minibatchSize`
`protected boolean`	`padMinibatches`
`protected String`	`prependToken`
`protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor`	`preProcessor`
`protected LabeledSentenceProvider`	`sentenceProvider`
`protected BertIterator.Task`	`task`
`protected TokenizerFactory`	`tokenizerFactory`
`protected BertIterator.UnsupervisedLabelFormat`	`unsupervisedLabelFormat`
`protected List<String>`	`vocabKeysAsList`
`protected Map<String,Integer>`	`vocabMap`

Constructor Summary

Constructors
Modifier Constructor and Description

protected BertIterator(BertIterator.Builder b)

Constructors
Modifier	Constructor and Description
`protected`	`BertIterator(BertIterator.Builder b)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`asyncSupported()`
`static BertIterator.Builder`	`builder()`
`boolean`	`hasNext()`
`org.nd4j.linalg.dataset.api.MultiDataSet`	`next()`
`org.nd4j.linalg.dataset.api.MultiDataSet`	`next(int num)`
`void`	`remove()`
`void`	`reset()`
`boolean`	`resetSupported()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
getPreProcessor, setPreProcessor

Methods inherited from interface java.util.Iterator
forEachRemaining

Field Detail

task
```
protected BertIterator.Task task
```

tokenizerFactory

protected TokenizerFactory tokenizerFactory

maxTokens
```
protected int maxTokens
```

minibatchSize
```
protected int minibatchSize
```

padMinibatches
```
protected boolean padMinibatches
```

preProcessor

protected org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor preProcessor

sentenceProvider

protected LabeledSentenceProvider sentenceProvider

lengthHandling

protected BertIterator.LengthHandling lengthHandling

featureArrays

protected BertIterator.FeatureArrays featureArrays

vocabMap
```
protected Map<String,Integer> vocabMap
```

masker
```
protected BertSequenceMasker masker
```

unsupervisedLabelFormat

protected BertIterator.UnsupervisedLabelFormat unsupervisedLabelFormat

maskToken
```
protected String maskToken
```

prependToken
```
protected String prependToken
```

vocabKeysAsList
```
protected List<String> vocabKeysAsList
```

Constructor Detail

BertIterator

protected BertIterator(BertIterator.Builder b)

Method Detail
- hasNext
```
public boolean hasNext()
```
  Specified by:
  
  hasNext in interface Iterator<org.nd4j.linalg.dataset.api.MultiDataSet>
- next
```
public org.nd4j.linalg.dataset.api.MultiDataSet next()
```
  Specified by:
  
  next in interface Iterator<org.nd4j.linalg.dataset.api.MultiDataSet>
- remove
```
public void remove()
```
  Specified by:
  
  remove in interface Iterator<org.nd4j.linalg.dataset.api.MultiDataSet>
- next
```
public org.nd4j.linalg.dataset.api.MultiDataSet next(int num)
```
  Specified by:
  
  next in interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
- resetSupported
```
public boolean resetSupported()
```
  Specified by:
  
  resetSupported in interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
- asyncSupported
```
public boolean asyncSupported()
```
  Specified by:
  
  asyncSupported in interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
- reset
```
public void reset()
```
  Specified by:
  
  reset in interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator
- builder
```
public static BertIterator.Builder builder()
```

Class BertIterator

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator

Methods inherited from interface java.util.Iterator

Field Detail

task

tokenizerFactory

maxTokens

minibatchSize

padMinibatches

preProcessor

sentenceProvider

lengthHandling

featureArrays

vocabMap

masker

unsupervisedLabelFormat

maskToken

prependToken

vocabKeysAsList

Constructor Detail

BertIterator

Method Detail

hasNext

next

remove

next

resetSupported

asyncSupported

reset

builder