What is: MoCo v3?
Source | An Empirical Study of Training Self-Supervised Vision Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
MoCo v3 aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders and with output vectors and . behaves like a "query", where the goal of learning is to retrieve the corresponding "key". The objective is to minimize a contrastive loss function of the following form:
This approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder has the back the backbone and projection head but not the prediction head. is updated by the moving average of , excluding the prediction head.