Viet-Anh on Software Logo

What is: MoCo v3?

SourceAn Empirical Study of Training Self-Supervised Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

MoCo v3 aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders fqf_q and fkf_k with output vectors qq and kk. qq behaves like a "query", where the goal of learning is to retrieve the corresponding "key". The objective is to minimize a contrastive loss function of the following form:

Lq=logexp(qk+/τ)exp(qk+/τ)+kexp(qk/τ)\mathcal{L_q}=-\log \frac{\exp \left(q \cdot k^{+} / \tau\right)}{\exp \left(q \cdot k^{+} / \tau\right)+\sum_{k^{-}} \exp \left(q \cdot k^{-} / \tau\right)}

This approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder fqf_q consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder fkf_k has the back the backbone and projection head but not the prediction head. fkf_k is updated by the moving average of fqf_q, excluding the prediction head.