What is: Factorized Random Synthesized Attention?
Source | Synthesizer: Rethinking Self-Attention in Transformer Models |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Factorized Random Synthesized Attention, introduced with the Synthesizer architecture, is similar to factorized dense synthesized attention but for random synthesizers. Letting being a randomly initialized matrix, we factorize into low rank matrices in the attention function:
Here is a parameterized function that is equivalent to in Scaled Dot-Product Attention.
For each head, the factorization reduces the parameter costs from to where and hence helps prevent overfitting. In practice, we use a small value of .
The basic idea of a Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.