In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

**Dense Prediction Transformers** (DPT) are a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) for dense prediction tasks.

The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a [ResNet](https://paperswithcode.com/method/resnet)-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple [transformer](https://paperswithcode.com/method/transformer) stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

Vision Transformers for Dense Prediction

ALIGN

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

LINE is a novel network embedding method which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures.

Source: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)

Image source: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)

Source	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com