What is: Semantic Cross Attention?
Source | SCAM! Transferring humans between images with Semantic Cross Attention Modulation |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Semantic Cross Attention (SCA) is based on cross attention, which we restrict with respect to a semantic mask.
The goal of SCA is two-fold depending on what is the query and what is the key. Either it allows to give the feature map information from a semantically restricted set of latents or, respectively, it allows a set of latents to retrieve information in a semantically restricted region of the feature map.
SCA is defined as:
\begin{equation} \text{SCA}(I_{1}, I_{2}, I_{3}) = \sigma\left(\frac{QK^T\odot I_{3} +\tau \left(1-I_{3}\right)}{\sqrt{d_{in}}}\right)V \quad , \end{equation}
where the inputs, with attending , and the mask that forces tokens from to attend only specific tokens from . The attention values requiring masking are filled with before the softmax. (In practice ), , and the queries, keys and values, and the internal attention dimension. is the softmax operation.
Let be the feature map with n the number of pixels, and C the number of channels. Let be a set of latents of dimension and the number of semantic labels. Each semantic label is attributed latents, such that . Each semantic label mask is assigned copies in .
We can differentiate 3 types of SCA:
(a) SCA with pixels attending latents : , where and . The idea is to force the pixels from a semantic region to attend latents that are associated with the same label.
(b) SCA with latents attending pixels : , where , . The idea is to semantically mask attention values to enforce latents to attend semantically corresponding pixels.
(c) SCA with latents attending themselves: , where . We denote this mask, with if the semantic label of latent is the same as the one of latent ; otherwise. The idea is to let the latents only attend latents that share the same semantic label.