lamp-core/lamp/lamp.nn/MultiheadAttention

MultiheadAttention

lamp.nn.MultiheadAttention$

See theMultiheadAttention companion class

object MultiheadAttention

Attributes

Companion: class
Graph
Supertypes: trait Product

trait Mirror

class Object

trait Matchable

class Any
Self type: MultiheadAttention.type

Members list

Type members

Classlikes

Attributes

Supertypes: trait Singleton

trait Product

trait Mirror

trait Serializable

trait Product

trait Equals

trait LeafTag

trait PTag

class Object

trait Matchable

class Any
Show all
Self type: WeightsK.type

Attributes

Supertypes: trait Singleton

trait Product

trait Mirror

trait Serializable

trait Product

trait Equals

trait LeafTag

trait PTag

class Object

trait Matchable

class Any
Show all
Self type: WeightsO.type

Attributes

Supertypes: trait Singleton

trait Product

trait Mirror

trait Serializable

trait Product

trait Equals

trait LeafTag

trait PTag

class Object

trait Matchable

class Any
Show all
Self type: WeightsQ.type

Attributes

Supertypes: trait Singleton

trait Product

trait Mirror

trait Serializable

trait Product

trait Equals

trait LeafTag

trait PTag

class Object

trait Matchable

class Any
Show all
Self type: WeightsV.type

Inherited types

The names of the product elements

Attributes

Inherited from:: Mirror

The name of the type

Attributes

Inherited from:: Mirror

Value members

Concrete methods

Linearized dot product attention https://arxiv.org/pdf/2006.16236.pdf

replaces exp(a dot b) with f(a) dot f(b) where f is any elementwise function, in the paper f(x) = elu(x)+1 here f(x) = swish1(x)+1 due to this decomposition a more efficient configuration of the chained matrix multiplication may be used: (Q Kt) V = Q (Kt V)

applies masking according to maskedSoftmax

Value parameters

key: batch x num k-v pairs x key dim
maxLength: batch x num queries OR batch , type long
query: batch x num queries x key dim
value: batch x num k-v pairs x value dim

Attributes

Returns: batch x num queries x value dim

Value parameters

input: batch x seq x ???
maxLength: batch x seq OR batch , long

Attributes

Returns: batch x seq x ???

Multi-head scaled dot product attention

See chapter 11.5 in d2l v1.0.0-beta0

Attention masking is implemented similarly to chapter 11.3.2.1 in d2l.ai v1.0.0-beta0. It supports unmasked attention, attention on variable length input, and left-to-right attention.

Value parameters

key: batch x num k-v pairs x dk
linearized: if true uses linearized attention. if false used scaled dot product attention
maxLength: batch x num queries OR batch , type long
numHeads: number of output heads, must be divisible by hidden
query: batch x num queries x dq
value: batch x num k-v pairs x dv
wKeys: dk x hidden
wOutput: hidden x po
wQuery: dq x hidden
wValues: dv x hidden

Attributes

Returns: batch x num queries x po

Scaled dot product attention

if maxLength is 2D: (batch,query,key) locations where maxLength(batch,query) > key are ignored.

if maxLength is 1D: (batch,query,key) locations where maxLength(batch) > query are ignored

See chapter 11.3.3 in d2l v1.0.0-beta0

Value parameters

key: batch x num k-v pairs x key dim
maxLength: batch x num queries OR batch, type long
query: batch x num queries x key dim
value: batch x num k-v pairs x value dim

Attributes

Returns: batch x num queries x value dim

Masks on the 3rd axis of maskable depending on the dimensions of maxLength

if maxLength is 2D: (batch,query,key) locations where maxLength(batch,query) > key are ignored.

if maxLength is 1D: (batch,query,key) locations where maxLength(batch) > query are ignored

Attributes

Masks the maskable(i,j,k) cell iff k >= maxLength(i)

Value parameters

fill: scalar
maskable: batch x seq x ???
maxLength: batch, type Long

Attributes

Masks the maskable(i,j,k) cell iff k >= maxLength(i,j)

Masks some elements on the last (3rd) axis of maskable

Value parameters

fill: scalar
maskable: batch x seq x ???
maxLength: batch x seq, type Long

Attributes

Implicits

In this article

Generated with