Viet-Anh on Software Logo

What is: Fast Attention Via Positive Orthogonal Random Features?

SourceRethinking Attention with Performers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

FAVOR+, or Fast Attention Via Positive Orthogonal Random Features, is an efficient attention mechanism used in the Performer architecture which leverages approaches such as kernel methods and random features approximation for approximating softmax and Gaussian kernels.

FAVOR+ works for attention blocks using matrices ARL×L\mathbf{A} \in \mathbb{R}^{L×L} of the form A(i,j)=K(q_iT,k_jT)\mathbf{A}(i, j) = K(\mathbf{q}\_{i}^{T}, \mathbf{k}\_{j}^{T}), with q_i/k_j\mathbf{q}\_{i}/\mathbf{k}\_{j} standing for the ith/jthi^{th}/j^{th} query/key row-vector in Q/K\mathbf{Q}/\mathbf{K} and kernel K:Rd×RdR_+K : \mathbb{R}^{d } × \mathbb{R}^{d} \rightarrow \mathbb{R}\_{+} defined for the (usually randomized) mapping: ϕ:RdRr_+\phi : \mathbb{R}^{d } → \mathbb{R}^{r}\_{+} (for some r>0r > 0) as:

K(x,y)=E[ϕ(x)Tϕ(y)]K(\mathbf{x}, \mathbf{y}) = E[\phi(\mathbf{x})^{T}\phi(\mathbf{y})]

We call ϕ(u)\phi(\mathbf{u}) a random feature map for uRd\mathbf{u} \in \mathbb{R}^{d} . For Q,KRL×r\mathbf{Q}^{'}, \mathbf{K}^{'} \in \mathbb{R}^{L \times r} with rows given as ϕ(q_iT)T\phi(\mathbf{q}\_{i}^{T})^{T} and ϕ(k_iT)T\phi(\mathbf{k}\_{i}^{T})^{T} respectively, this leads directly to the efficient attention mechanism of the form:

Att_^(Q,K,V)=D^1(Q((K)TV)) \hat{Att\_{\leftrightarrow}}\left(\mathbf{Q}, \mathbf{K}, \mathbf{V}\right) = \hat{\mathbf{D}}^{-1}(\mathbf{Q^{'}}((\mathbf{K^{'}})^{T}\mathbf{V}))

where

D^=diag(Q((K)1_L))\mathbf{\hat{D}} = \text{diag}(\mathbf{Q^{'}}((\mathbf{K^{'}})\mathbf{1}\_{L}))

The above scheme constitutes the FA-part of the FAVOR+ mechanism. The other parts are achieved by:

  • The R part : The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).
  • The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.

The details are quite technical, so it is recommended you read the paper for further information on these steps.