**Vokenization** is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images ("vokens") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the
input supervision from these large datasets, thus bridging the gap between different data sources.

**Entropy Regularization** is a type of regularization used in [reinforcement learning](https://paperswithcode.com/methods/area/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https://paperswithcode.com/method/a3c), the same mutual  reinforcement behaviour leads to a highly-peaked $\pi\left(a\mid{s}\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:

$$H(X) = -\sum\pi\left(x\right)\log\left(\pi\left(x\right)\right) $$

Image Credit: Wikipedia

Entropy Regularization

Asynchronous Methods for Deep Reinforcement Learning

Vokenization

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation.

Source	Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Vokenization?

Viet-Anh on Software