What is: InterBERT?
Source | InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.