**Visformer**, or **Vision-friendly Transformer**, is an architecture that combines [Transformer](https://paperswithcode.com/methods/category/transformers)-based architectural features with those from [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) architectures. Visformer adopts the stage-wise design for higher base performance. But [self-attentions](https://paperswithcode.com/method/multi-head-attention) are only utilized in the last two stages, considering that self-attention in the high-resolution stage is relatively inefficient even when the FLOPs are balanced. Visformer employs [bottleneck blocks](https://paperswithcode.com/method/bottleneck-residual-block) in the first stage and utilizes [group 3 × 3 convolutions](https://paperswithcode.com/method/grouped-convolution) in bottleneck blocks inspired by [ResNeXt](https://paperswithcode.com/method/resnext). It also introduces [BatchNorm](https://paperswithcode.com/method/batch-normalization) to patch embedding modules as in CNNs.

**Generative Adversarial Imitation Learning** presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.

GAIL

Generative Adversarial Imitation Learning

Visformer

Visformer: The Vision-friendly Transformer

**SpineNet** is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search).

Source	Visformer: The Vision-friendly Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Visformer?

Viet-Anh on Software