What is: Contrastive Language-Image Pre-training?
Source | Learning Transferable Visual Models From Natural Language Supervision |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.
For pre-training, CLIP is trained to predict which of the possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs in the batch while minimizing the cosine similarity of the embeddings of the incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.
Image credit: Learning Transferable Visual Models From Natural Language Supervision