**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.

**PanGu-$α$** is an autoregressive language model (ALM) with up to 200 billion parameters pretrained on a large corpus of text, mostly in Chinese language. The architecture of PanGu-$α$ is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as [BERT](https://paperswithcode.com/method/bert) and [GPT](https://paperswithcode.com/method/gpt). Different from them, there's an additional query layer developed on top of Transformer layers which aims to explicitly induce the expected output.

PanGu-$α$

PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

LAMA

Low Rank Factorization for Compact Multi-Head Self-Attention

**CBHG** is a building block used in the [Tacotron](https://paperswithcode.com/method/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https://paperswithcode.com/method/bigru)). 

The module is used to extract representations from sequences. The input sequence is first
convolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https://paperswithcode.com/method/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to  preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https://paperswithcode.com/method/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https://paperswithcode.com/method/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https://paperswithcode.com/method/gru) RNN is stacked on top to extract sequential features from both forward and backward context.

Source	Low Rank Factorization for Compact Multi-Head Self-Attention
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Low-Rank Factorization-based Multi-Head Attention?

Viet-Anh on Software