**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.

**WordPiece** is a subword segmentation algorithm used in natural language processing.  The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:

1. Initialize the word unit inventory with all the characters in the text.
2. Build a language model on the training data using the inventory from 1.
3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.

Text: [Source](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944)

Image: WordPiece as used in [BERT](https://paperswithcode.com/method/bert)

WordPiece

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Gradient Sparsification

Gradient Sparsification for Communication-Efficient Distributed Optimization

**PanGu-$α$** is an autoregressive language model (ALM) with up to 200 billion parameters pretrained on a large corpus of text, mostly in Chinese language. The architecture of PanGu-$α$ is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as [BERT](https://paperswithcode.com/method/bert) and [GPT](https://paperswithcode.com/method/gpt). Different from them, there's an additional query layer developed on top of Transformer layers which aims to explicitly induce the expected output.

Source	Gradient Sparsification for Communication-Efficient Distributed Optimization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Gradient Sparsification?

Viet-Anh on Software