**Dense Prediction Transformers** (DPT) are a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) for dense prediction tasks.

The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a [ResNet](https://paperswithcode.com/method/resnet)-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple [transformer](https://paperswithcode.com/method/transformer) stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

A **Convolutional Gated Recurrent Unit** is a type of [GRU](https://paperswithcode.com/method/gru) that combines GRUs with the [convolution](https://paperswithcode.com/method/convolution) operation. The update rule for input $x\_{t}$ and the previous output $h\_{t-1}$ is given by the following:

$$ r = \sigma\left(W\_{r} \star\_{n}\left[h\_{t-1};x\_{t}\right] + b\_{r}\right) $$

$$ u = \sigma\left(W\_{u} \star\_{n}\left[h\_{t-1};x\_{t}\right] + b\_{u} \right) $$

$$ c = \rho\left(W\_{c} \star\_{n}\left[x\_{t}; r \odot h\_{t-1}\right] + b\_{c} \right) $$

$$ h\_{t} = u \odot h\_{t-1} + \left(1-u\right) \odot c $$

In these equations $\sigma$ and $\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\star\_{n}$ represents a convolution with a kernel of size $n \times n$. Brackets are used to represent a feature concatenation.

CGRU

Delving Deeper into Convolutional Networks for Learning Video Representations

Vision Transformers for Dense Prediction

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Source	Vision Transformers for Dense Prediction
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

What is: Dense Prediction Transformer?

Viet-Anh on Software