**Talking-Heads Attention** is a variation on [multi-head attention](https://paperswithcode.com/method/multi-head-attention) which includes linear projections across the attention-heads dimension, immediately before and after the [softmax](https://paperswithcode.com/method/softmax) operation. In [multi-head attention](https://paperswithcode.com/method/multi-head-attention), the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P\_{l}$ and $P\_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h\_{k}$, $h$, and $h\_{v}$, which can optionally differ in size (number of "heads"). $h\_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h\_{v}$ refers to the number of attention heads for the values.

In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.

AltCLIP

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Talking-Heads Attention

The Stacked Denoising Autoencoder (SdA) is an extension of the stacked autoencoder [Bengio07] and it was introduced in [Vincent08].

Denoising autoencoders can be stacked to form a deep network by feeding the latent representation (output code) of the [denoising autoencoder](https://paperswithcode.com/method/denoising-autoencoder) found on the layer below as input to the current layer. The unsupervised pre-training of such an architecture is done one layer at a time. Each layer is trained as a denoising autoencoder by minimizing the error in reconstructing its input (which is the output code of the previous layer). Once the first k layers are trained, we can train the k+1-th layer because we can now compute the code or latent representation from the layer below.

Once all layers are pre-trained, the network goes through a second stage of training called fine-tuning. Here we consider supervised fine-tuning where we want to minimize prediction error on a supervised task. For this, we first add a [logistic regression](https://paperswithcode.com/method/logistic-regression) layer on top of the network (more precisely on the output code of the output layer). We then train the entire network as we would train a multilayer perceptron. At this point, we only consider the encoding parts of each auto-encoder. This stage is supervised, since now we use the target class during training. (See the Multilayer Perceptron for details on the multilayer perceptron.)

This can be easily implemented in Theano, using the class defined previously for a denoising autoencoder. We can see the stacked denoising autoencoder as having two facades: a list of autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model as a list of autoencoders, and train each autoencoder seperately. In the second stage of training, we use the second facade. These two facades are linked because:
* the autoencoders and the sigmoid layers of the MLP share parameters, and
* the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.

Extracted from [webpage](http://deeplearning.net/tutorial/SdA.html)

Image: [Jigar Bandaria](https://miro.medium.com/max/701/1*wbaL5CvUkVkZxlSUsRS5IQ.png)

**Source**:

Image: [Jigar Bandaria](https://blog.insightdatascience.com/brain-mri-image-segmentation-using-stacked-denoising-autoencoders-4e91417688f6)

Webpage: [deeplearning.net](http://deeplearning.net/tutorial/SdA.html)

Webpage: [www.iro.umontreal.ca](http://www.iro.umontreal.ca/~pift6266/H10/notes/SdA.html)

Paper:

[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](https://doi.org/10.1145/1390156.1390294)

[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217)

Source	Talking-Heads Attention
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com