Viet-Anh on Software Logo

What is: Dense Synthesized Attention?

SourceSynthesizer: Rethinking Self-Attention in Transformer Models
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Dense Synthesized Attention, introduced with the Synthesizer architecture, is a type of synthetic attention mechanism that replaces the notion of query-key-values in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input XRl x dX \in \mathbb{R}^{l\text{ x }d} and produces an output of YRl x dY \in \mathbb{R}^{l\text{ x }d}. Here ll refers to the sequence length and dd refers to the dimensionality of the model. We first adopt F(.)F\left(.\right), a parameterized function, for projecting input X_iX\_{i} from dd dimensions to ll dimensions.

B_i=F(X_i)B\_{i} = F\left(X\_{i}\right)

where F(.)F\left(.\right) is a parameterized function that maps Rd\mathbb{R}^{d} to Rl\mathbb{R}^{l} and ii is the ii-th token of XX. Intuitively, this can be interpreted as learning a token-wise projection to the sequence length ll. Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with ReLU activations for F(.)F\left(.\right) is adopted:

F(X)=W(σ_R(W(X)+b))+b F\left(X\right) = W\left(\sigma\_{R}\left(W(X) + b\right)\right) + b

where σ_R\sigma\_{R} is the ReLU activation function. Hence, BB is now of Rl x d\mathbb{R}^{l\text{ x }d}. Given BB, we now compute:

Y=Softmax(B)G(X)Y = \text{Softmax}\left(B\right)G\left(X\right)

where G(.)G\left(.\right) is another parameterized function of XX that is analogous to VV (value) in the standard Transformer model. This approach eliminates the dot product altogether by replacing QKTQK^{T} in standard Transformers with the synthesizing function F(.)F\left(.\right).