TransformerEncoderBlock

lamp.nn.TransformerEncoderBlock
See theTransformerEncoderBlock companion object
case class TransformerEncoderBlock(attention: MultiheadAttention, layerNorm1: LayerNorm, layerNorm2: LayerNorm, w1: Constant, b1: Constant, w2: Constant, b2: Constant, scale1: Constant, scale2: Constant, dropout: Double, train: Boolean, gptOrder: Boolean) extends GenericModule[(Variable, Option[STen]), Variable]

A single block of the transformer self attention encoder using GELU

Input is (data, maxLength) where data is (batch, sequence, input dimension), double tensor maxLength is a 1D or 2D long tensor used for attention masking.

The order of operations depends on gptOrder param. If gptOrder is true then:

  • y = attention(norm(input))+input
  • result = mlp(norm(y))+y
  • Note that in this case there is no normalization at the end of the transformer. One may wants to add one separately. This is how GPT2 is defined in hugging face or nanoGPT.
  • Note that the residual connection has a path which does not flow through the normalization.
    • dimension wise learnable scale parameter in each residual path

If gptOrder is false then:

  • y = norm(attention(input)+input )
  • result = norm(mlp(y)+y)
  • This follows chapter 11.7 in d2l.ai v1.0.0-beta0. (Same as in https://arxiv.org/pdf/1706.03762.pdf)
  • Note that the residual connection has a path which flows through the normalization.

Output is (bach, sequence, output dimension)

Attributes

Companion
object
Graph
Supertypes
trait Serializable
trait Product
trait Equals
trait GenericModule[(Variable, Option[STen]), Variable]
class Object
trait Matchable
class Any
Show all

Members list

Value members

Concrete methods

def forward[S : Sc](x: (Variable, Option[STen])): Variable

The implementation of the function.

The implementation of the function.

In addition of x it can also use all the `state to compute its value.

Attributes

def state: Seq[(Constant, PTag)]

List of optimizable, or non-optimizable, but stateful parameters

List of optimizable, or non-optimizable, but stateful parameters

Stateful means that the state is carried over the repeated forward calls.

Attributes

Inherited methods

def apply[S : Sc](a: (Variable, Option[STen])): B

Alias of forward

Alias of forward

Attributes

Inherited from:
GenericModule
final def gradients(loss: Variable, zeroGrad: Boolean): Seq[Option[STen]]

Computes the gradient of loss with respect to the parameters.

Computes the gradient of loss with respect to the parameters.

Attributes

Inherited from:
GenericModule
final def learnableParameters: Long

Returns the total number of optimizable parameters.

Returns the total number of optimizable parameters.

Attributes

Inherited from:
GenericModule
final def parameters: Seq[(Constant, PTag)]

Returns the state variables which need gradient computation.

Returns the state variables which need gradient computation.

Attributes

Inherited from:
GenericModule
def productElementNames: Iterator[String]

Attributes

Inherited from:
Product
def productIterator: Iterator[Any]

Attributes

Inherited from:
Product
final def zeroGrad(): Unit

Attributes

Inherited from:
GenericModule