What is: Vokenization?
Source | Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Vokenization is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images ("vokens") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the input supervision from these large datasets, thus bridging the gap between different data sources.