Viet-Anh on Software Logo

What is: EsViT?

SourceEfficient Self-supervised Vision Transformers for Representation Learning
Data SourceCC BY-SA -

EsViT proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.