What is: VATT?
Source | VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Video-Audio-Text Transformer, or VATT, is a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from BERT and ViT except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.
VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.