What is: Vision-and-Language BERT?
Source | ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.