**Transformer-XL** (meaning extra long) is a [Transformer](https://paperswithcode.com/method/transformer) architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.

A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https://paperswithcode.com/method/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts.

Image Source: [here](https://www.google.com/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Transformer-XL

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

**MultiGrain** is a type of image model that learns a single embedding for classes, instances and copies.  In other words, it is a convolutional neural network that is suitable for both image classification and instance retrieval. We learn MultiGrain by jointly training an image embedding for multiple tasks. The resulting representation is compact and can outperform narrowly-trained embeddings. The learned embedding output incorporates different levels of granularity.

Source	Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Transformer-XL?

Viet-Anh on Software