Viet-Anh on Software Logo

What is: Cross-Covariance Attention?

SourceXCiT: Cross-Covariance Image Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Cross-Covariance Attention, or XCA, is an attention mechanism which operates along the feature dimension instead of the token dimension as in conventional transformers.

Using the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:

 XC-Attention (Q,K,V)=VAXC(K,Q),A_XC(K,Q)=Softmax(K^Q^/τ)\text { XC-Attention }(Q, K, V)=V \mathcal{A}_{\mathrm{XC}}(K, Q), \quad \mathcal{A}\_{\mathrm{XC}}(K, Q)=\operatorname{Softmax}\left(\hat{K}^{\top} \hat{Q} / \tau\right)

where each output token embedding is a convex combination of the d_vd\_{v} features of its corresponding token embedding in VV. The attention weights A\mathcal{A} are computed based on the cross-covariance matrix.