What is: Spatially Separable Self-Attention?
Source | Twins: Revisiting the Design of Spatial Attention in Vision Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Spatially Separable Self-Attention, or SSSA, is an attention module used in the Twins-SVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given high-resolution inputs). SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA).
Formally, spatially separable self-attention (SSSA) can be written as:
\hat{\mathbf{z}}\_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}\_{i j}^{l-1}\right)\right)+\mathbf{z}\_{i j}^{l-1} $$
$$\mathbf{z}\_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}\_{i j}^{l}\right)\right)+\hat{\mathbf{z}}\_{i j}^{l} $$
$$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$
$$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$
$$i \in\{1,2, \ldots ., m\}, j \in\{1,2, \ldots ., n\}
where LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window Both LSA and GSA have multiple heads as in the standard self-attention.