What is: ConViT?
Source | ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.