What is: Vision-and-Langauge Transformer?
Source | ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.