What is: Dense Synthesized Attention?
Source | Synthesizer: Rethinking Self-Attention in Transformer Models |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Dense Synthesized Attention, introduced with the Synthesizer architecture, is a type of synthetic attention mechanism that replaces the notion of query-key-values in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input and produces an output of . Here refers to the sequence length and refers to the dimensionality of the model. We first adopt , a parameterized function, for projecting input from dimensions to dimensions.
where is a parameterized function that maps to and is the -th token of . Intuitively, this can be interpreted as learning a token-wise projection to the sequence length . Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with ReLU activations for is adopted:
where is the ReLU activation function. Hence, is now of . Given , we now compute:
where is another parameterized function of that is analogous to (value) in the standard Transformer model. This approach eliminates the dot product altogether by replacing in standard Transformers with the synthesizing function .