Viet-Anh on Software Logo

What is: Factorized Dense Synthesized Attention?

SourceSynthesizer: Rethinking Self-Attention in Transformer Models
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows:

A,B=F_A(X_i),F_B(X_i)A, B = F\_{A}\left(X\_{i}\right), F\_{B}\left(X\_{i}\right)

where F_A(.)F\_{A}\left(.\right) projects input X_iX\_{i} into aa dimensions, F_B(.)F\_B\left(.\right) projects X_iX\_{i} to bb dimensions, and a x b=la \text{ x } b = l. The output of the factorized module is now written as:

Y=Softmax(C)G(X)Y = \text{Softmax}\left(C\right)G\left(X\right)

where C=H_A(A)H_B(B)C = H\_{A}\left(A\right) * H\_{B}\left(B\right), where H_AH\_{A}, H_BH\_{B} are tiling functions and CRl x lC \in \mathbb{R}^{l \text{ x } l}. The tiling function simply duplicates the vector kk times, i.e., RlRlk\mathbb{R}^{l} \rightarrow \mathbb{R}^{lk}. In this case, H_A()H\_{A}\left(\right) is a projection of RaRab\mathbb{R}^{a} \rightarrow \mathbb{R}^{ab} and H_B()H\_{B}\left(\right) is a projection of RbRba\mathbb{R}^{b} \rightarrow \mathbb{R}^{ba}. To avoid having similar values within the same block, we compose the outputs of H_AH\_{A} and H_BH\_{B}.