What is: Dense Prediction Transformer?
Source | Vision Transformers for Dense Prediction |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks.
The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.