Viet-Anh on Software Logo

What is: Vision-and-Langauge Transformer?

SourceViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.