Viet-Anh on Software Logo

What is: Factorized Random Synthesized Attention?

SourceSynthesizer: Rethinking Self-Attention in Transformer Models
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Factorized Random Synthesized Attention, introduced with the Synthesizer architecture, is similar to factorized dense synthesized attention but for random synthesizers. Letting RR being a randomly initialized matrix, we factorize RR into low rank matrices R_1,R_2Rl xkR\_{1}, R\_{2} \in \mathbb{R}^{l\text{ x}k} in the attention function:

Y=Softmax(R_1R_2T)G(X).Y = \text{Softmax}\left(R\_{1}R\_{2}^{T}\right)G\left(X\right) .

Here G(.)G\left(.\right) is a parameterized function that is equivalent to VV in Scaled Dot-Product Attention.

For each head, the factorization reduces the parameter costs from l2l^{2} to 2(lk)2\left(lk\right) where k<<lk << l and hence helps prevent overfitting. In practice, we use a small value of k=8k = 8.

The basic idea of a Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.