Viet-Anh on Software Logo

What is: Class Attention?

SourceGoing deeper with Image Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding xclass x_{\text {class }} (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings xpatches .x_{\text {patches }} .

Considering a network with hh heads and pp patches, and denoting by dd the embedding size, the multi-head class-attention is parameterized with several projection matrices, Wq,Wk,Wv,WoRd×dW_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}, and the corresponding biases bq,bk,bv,boRd.b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} . With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as z=[xclass ,xpatches ]z=\left[x_{\text {class }}, x_{\text {patches }}\right]. We then perform the projections:

Q=W_qx_class +b_qQ=W\_{q} x\_{\text {class }}+b\_{q}

K=W_kz+b_kK=W\_{k} z+b\_{k}

V=W_vz+b_vV=W\_{v} z+b\_{v}

The class-attention weights are given by

A=Softmax(Q.KT/d/h)A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right)

where Q.KTRh×1×pQ . K^{T} \in \mathbf{R}^{h \times 1 \times p}. This attention is involved in the weighted sum A×VA \times V to produce the residual output vector

out_CA=W_oAV+b_o\operatorname{out}\_{\mathrm{CA}}=W\_{o} A V+b\_{o}

which is in turn added to x_class x\_{\text {class }} for subsequent processing.