- Companion:
- class
Type members
Classlikes
Value members
Concrete methods
Linearized dot product attention https://arxiv.org/pdf/2006.16236.pdf
Linearized dot product attention https://arxiv.org/pdf/2006.16236.pdf
replaces exp(a dot b) with f(a) dot f(b) where f is any elementwise function, in the paper f(x) = elu(x)+1 here f(x) = swish1(x)+1 due to this decomposition a more efficient configuration of the chained matrix multiplication may be used: (Q Kt) V = Q (Kt V)
(batch,query) locations where tokens(batch,query) == pad are ignored
- Value parameters:
- key
batch x num k-v pairs x key dim
- pad
scalar long
- query
batch x num queries x key dim
- tokens
batch x num queries , type long
- value
batch x num k-v pairs x value dim
- Returns:
batch x num queries x value dim
- Value parameters:
- input
batch x seq x ???
- mask
scalar long
- tokens
batch x seq , long
- Returns:
batch x seq x ???
Multi-head scaled dot product attention
Multi-head scaled dot product attention
(batch,query) locations where tokens(batch,query) == pad are ignored
- Value parameters:
- key
batch x num k-v pairs x dk
- numHeads
number of output heads, must be divisible by hidden
- pad
scalar long
- query
batch x num queries x dq
- tokens
batch x num queries , type long
- value
batch x num k-v pairs x dv
- wKeys
dk x hidden
- wOutput
hidden x po
- wQuery
dq x hidden
- wValues
dv x hidden
- Returns:
batch x num queries x po
Scaled dot product attention
Scaled dot product attention
(batch,query) locations where tokens(batch,query) == pad are ignored
- Value parameters:
- key
batch x num k-v pairs x key dim
- pad
scalar long
- query
batch x num queries x key dim
- tokens
batch x num queries , type long
- value
batch x num k-v pairs x value dim
- Returns:
batch x num queries x value dim