What is: Class Attention?
Source | Going deeper with Image Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings
Considering a network with heads and patches, and denoting by the embedding size, the multi-head class-attention is parameterized with several projection matrices, , and the corresponding biases With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as . We then perform the projections:
The class-attention weights are given by
where . This attention is involved in the weighted sum to produce the residual output vector
which is in turn added to for subsequent processing.