What is: Crossmodal Contrastive Learning?
Source | UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
CMCL, or Crossmodal Contrastive Learning, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other.