**EsViT** proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.

**Grid R-CNN** is an object detection framework, where the traditional regression
formulation is replaced by a grid point guided localization mechanism.

Grid R-CNN divides the object bounding box region into grids and employs a fully convolutional network ([FCN](https://paperswithcode.com/method/fcn)) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.

Grid R-CNN

EsViT

Efficient Self-supervised Vision Transformers for Representation Learning

The **Re-Attention Module** is an attention layer used in the [DeepViT](https://paperswithcode.com/method/deepvit) architecture which mixes the attention map with a learnable matrix before multiplying with the values. The motivation is to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The authors note that traditional self-attention fails to learn effective concepts for representation learning in deeper layers of ViT -- attention maps become more similar and less diverse in deeper layers (attention collapse) - and this hinders the model from getting expected performance gain. Re-attention is implemented by:

$$
\operatorname{Re}-\operatorname{Attention}(Q, K, V)=\operatorname{Norm}\left(\Theta^{\top}\left(\operatorname{Softmax}\left(\frac{Q K^{\top}}{\sqrt{d}}\right)\right)\right) V
$$

where transformation matrix $\Theta$ is multiplied to the self-attention map $\textbf{A}$ along the head dimension.

Source	Efficient Self-supervised Vision Transformers for Representation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com