**ConViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a gated positional self-attention module ([GPSA](https://paperswithcode.com/method/gpsa)), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

**VQ-VAE-2** is a type of variational autoencoder that combines a a two-level hierarchical VQ-[VAE](https://paperswithcode.com/method/vae) with a self-attention autoregressive model ([PixelCNN](https://paperswithcode.com/method/pixelcnn)) as a prior. The encoder and decoder architectures are kept simple and light-weight as in the original [VQ-VAE](https://paperswithcode.com/method/vq-vae), with the only difference that hierarchical multi-scale latent maps are used for increased resolution.

VQ-VAE-2

Generating Diverse High-Fidelity Images with VQ-VAE-2

ConViT

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

**DeLiGHT** is a [transformer](https://paperswithcode.com/method/transformer) architecture that delivers parameter efficiency improvements by (1) within each Transformer block using [DExTra](https://paperswithcode.com/method/dextra), a deep and light-weight transformation, allowing for the use of [single-headed attention](https://paperswithcode.com/method/single-headed-attention) and bottleneck FFN layers and (2) across blocks using block-wise scaling, that allows for shallower and narrower [DeLighT blocks](https://paperswithcode.com/method/delight-block) near the input and wider and deeper DeLighT blocks near the output.

Source	ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: ConViT?

Viet-Anh on Software