Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.

**Shake-Shake Regularization**  aims to improve the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. A typical pre-activation [ResNet](https://paperswithcode.com/method/resnet) with 2 residual branches would follow this equation:

$$x\_{i+1} = x\_{i} + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

Shake-shake regularization introduces a random variable $\alpha\_{i}$  following a uniform distribution between 0 and 1 during training:

$$x\_{i+1} = x\_{i} + \alpha\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \left(1-\alpha\right)\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

Following the same logic as for [dropout](https://paperswithcode.com/method/dropout), all $\alpha\_{i}$ are set to the expected value of $0.5$ at test time.

Shake-Shake Regularization

Shake-Shake regularization

MAVL

Class-agnostic Object Detection with Multi-modal Transformer

**gCANS**, or **Global Coupled Adaptive Number of Shots**, is a variational quantum algorithm for stochastic gradient descent. It adaptively allocates shots for the measurement of each gradient component at each iteration. The optimizer uses a criterion for allocating shots that incorporates information about the overall scale of the shot cost for the iteration.

Source	Class-agnostic Object Detection with Multi-modal Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Multiscale Attention ViT with Late fusion?

Viet-Anh on Software