Viet-Anh on Software Logo

What is: MoBY?

SourceSelf-Supervised Learning with Swin Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

MoBY is a self-supervised learning approach for Vision Transformers. The approach is basically a combination of MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL. It is named MoBY by picking the first two letters of each method.

The MoBY approach is illustrated in the Figure. There are two encoders: an online encoder and a target encoder. Both two encoders consist of a backbone and a projector head (2-layer MLP), and the online encoder introduces an additional prediction head (2-layer MLP), which makes the two encoders asymmetric. The online encoder is updated by gradients, and the target encoder is a moving average of the online encoder by momentum updating in each training iteration. A gradually increasing momentum updating strategy is applied for on the target encoder: the value of momentum term is gradually increased to 1 during the course of training. The default starting value is 0.990.99.

A contrastive loss is applied to learn the representations. Specifically, for an online view qq, its contrastive loss is computed as

L_q=logexp(qk_+/τ)_i=0Kexp(qk_i/τ)\mathcal{L}\_{q}=-\log \frac{\exp \left(q \cdot k\_{+} / \tau\right)}{\sum\_{i=0}^{K} \exp \left(q \cdot k\_{i} / \tau\right)}

where k_+k\_{+}is the target feature for the other view of the same image; k_ik\_{i} is a target feature in the key queue; τ\tau is a temperature term; KK is the size of the key queue (4096 by default).

In training, like most Transformer-based methods, the AdamW optimizer is used, in contrast to previous self-supervised learning approaches built on ResNet backbone where usually SGD or LARS [4,8,19][4,8,19] is used. The authors also use a regularization method of asymmetric drop path which proves important for the final performance.

In the experiments, the authors adopt a fixed learning rate of 0.0010.001 and a fixed weight decay of 0.050.05, which performs stably well. Hyper-parameters are tuned of the key queue size KK, the starting momentum value of the target branch, the temperature τ\tau, and the drop path rates.