**PanGu-$α$** is an autoregressive language model (ALM) with up to 200 billion parameters pretrained on a large corpus of text, mostly in Chinese language. The architecture of PanGu-$α$ is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as [BERT](https://paperswithcode.com/method/bert) and [GPT](https://paperswithcode.com/method/gpt). Different from them, there's an additional query layer developed on top of Transformer layers which aims to explicitly induce the expected output.

**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.

Gradient Sparsification

Gradient Sparsification for Communication-Efficient Distributed Optimization

PanGu-$α$

PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.

Source	PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com