**Multiscale Vision Transformer**, or **MViT**, is a [transformer](https://paperswithcode.com/method/transformer) architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.

RESCAL

A Three-Way Model for Collective Learning on Multi-Relational Data

MViT

Multiscale Vision Transformers

**RMS Pooling** is a pooling operation that calculates the square mean root for patches of a feature map, and uses it to create a downsampled (pooled) feature map.  It is usually used after a convolutional layer.

$$ z_{j} = \sqrt{\frac{1}{M}\sum^{M}_{i=1}u{ij}^{2}} $$

Source	Multiscale Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

What is: Multiscale Vision Transformer?

Viet-Anh on Software