What is: Global Sub-Sampled Attention?
Source | Twins: Revisiting the Design of Spatial Attention in Vision Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Global Sub-Sampled Attention, or GSA, is a local attention mechanism used in the Twins-SVT architecture.
A single representative is used to summarize the key information for each of subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to . This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA).
If we alternatively use the LSA and GSA like separable convolutions (depth-wise + point-wise). The total computation cost is We have:
The minimum is obtained when . Note that is popular in classification. Without loss of generality, square sub-windows are used, i.e., . Therefore, is close to the global minimum for . However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of , the minimum is obtained when . Theoretically, we can calibrate optimal and for each of the stages. For simplicity, is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.