What is: Unified VLP?
Source | Unified Vision-Language Pre-Training for Image Captioning and VQA |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Unified VLP is unified encoder-decoder model for general vision-language pre-training. The models uses a shared multi-layer transformers network for both encoding and decoding. The model is pre-trained on large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. Model architecture for pre-training. For pre-training , the input comprises of image input, sentence input, and three special tokens ([CLS], [SEP], [STOP]). The image is processed as Region of Interests (RoIs) and region features are extracted. The sentence is tokenized and masked with [MASK] tokens for the later masked language modeling task. The model consists of 12 layers of Transformer blocks, each having a masked self-attention layer and feed-forward module, where the self-attention mask controls what input context the prediction conditions on. Two self-attention masks are implemented depending on whether the objective is bidirectional or seq2seq. The model is fine-tuned for image captioning and visual question answering.