TransformerEncoderBlock

lamp.nn.TransformerEncoderBlock

See theTransformerEncoderBlock companion object

case class TransformerEncoderBlock(attention: MultiheadAttention, layerNorm1: LayerNorm, layerNorm2: LayerNorm, w1: Constant, b1: Constant, w2: Constant, b2: Constant, scale1: Constant, scale2: Constant, dropout: Double, train: Boolean, gptOrder: Boolean) extends GenericModule[(Variable, Option[STen]), Variable]

A single block of the transformer self attention encoder using GELU

Input is (data, maxLength) where data is (batch, sequence, input dimension), double tensor maxLength is a 1D or 2D long tensor used for attention masking.

The order of operations depends on gptOrder param. If gptOrder is true then:

y = attention(norm(input))+input
result = mlp(norm(y))+y
Note that in this case there is no normalization at the end of the transformer. One may wants to add one separately. This is how GPT2 is defined in hugging face or nanoGPT.
Note that the residual connection has a path which does not flow through the normalization.
- dimension wise learnable scale parameter in each residual path

If gptOrder is false then:

y = norm(attention(input)+input )
result = norm(mlp(y)+y)
This follows chapter 11.7 in d2l.ai v1.0.0-beta0. (Same as in https://arxiv.org/pdf/1706.03762.pdf)
Note that the residual connection has a path which flows through the normalization.

Output is (bach, sequence, output dimension)

Attributes

Companion: object
Graph
Supertypes: trait Serializable

trait Product

trait Equals

trait GenericModule[(Variable, Option[STen]), Variable]

class Object

trait Matchable

class Any
Show all

Members list

Value members

Concrete methods

The implementation of the function.

In addition of x it can also use all the `state to compute its value.

Attributes

List of optimizable, or non-optimizable, but stateful parameters

Stateful means that the state is carried over the repeated forward calls.

Attributes

Inherited methods

Alias of forward

Attributes

Inherited from:: GenericModule

Computes the gradient of loss with respect to the parameters.

Attributes

Inherited from:: GenericModule

Returns the total number of optimizable parameters.

Attributes

Inherited from:: GenericModule

Returns the state variables which need gradient computation.

Attributes

Inherited from:: GenericModule

Attributes

Inherited from:: Product

Attributes

Inherited from:: Product

Attributes

Inherited from:: GenericModule

In this article

Generated with