Viet-Anh on Software Logo

What is: Attention Free Transformer?

SourceAn Attention Free Transformer
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Attention Free Transformer, or AFT, is an efficient variant of a multi-head attention module that eschews dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.

Given the input XX, AFT first linearly transforms them into Q=XWQ,K=XWK,V=XWVQ=X W^{Q}, K=X W^{K}, V=X W^{V}, then performs following operation:

Y=f(X);Y_t=σ_q(Q_t)_t=1Texp(K_t+w_t,t)V_t_t=1Texp(K_t+w_t,t)Y=f(X) ; Y\_{t}=\sigma\_{q}\left(Q\_{t}\right) \odot \frac{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right) \odot V\_{t^{\prime}}}{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right)}

where \odot is the element-wise product; σ_q\sigma\_{q} is the nonlinearity applied to the query with default being sigmoid; wRT×Tw \in R^{T \times T} is the learned pair-wise position biases.

Explained in words, for each target position tt, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.