lamp.nn.TransformerEncoderBlock
See theTransformerEncoderBlock companion object
case class TransformerEncoderBlock(attention: MultiheadAttention, layerNorm1: LayerNorm, layerNorm2: LayerNorm, w1: Constant, b1: Constant, w2: Constant, b2: Constant, scale1: Constant, scale2: Constant, dropout: Double, train: Boolean, gptOrder: Boolean) extends GenericModule[(Variable, Option[STen]), Variable]
A single block of the transformer self attention encoder using GELU
Input is (data, maxLength)
where data
is (batch, sequence, input dimension), double tensor maxLength
is a 1D or 2D long tensor used for attention masking.
The order of operations depends on gptOrder param. If gptOrder
is true then:
- y = attention(norm(input))+input
- result = mlp(norm(y))+y
- Note that in this case there is no normalization at the end of the transformer. One may wants to add one separately. This is how GPT2 is defined in hugging face or nanoGPT.
- Note that the residual connection has a path which does not flow through the normalization.
-
- dimension wise learnable scale parameter in each residual path
If gptOrder
is false then:
- y = norm(attention(input)+input )
- result = norm(mlp(y)+y)
- This follows chapter 11.7 in d2l.ai v1.0.0-beta0. (Same as in https://arxiv.org/pdf/1706.03762.pdf)
- Note that the residual connection has a path which flows through the normalization.
Output is (bach, sequence, output dimension)
Attributes
- Companion
- object
- Graph
-
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass Any
Members list
In this article