What is: EsViT?
Source | Efficient Self-supervised Vision Transformers for Representation Learning |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
EsViT proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.