Many communication-efficient variants of [SGD](https://paperswithcode.com/method/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.

**SRGAN** is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, the authors use a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks - depicted in the Figure to the right - consist mainly of residual blocks for feature extraction.

Formally we write the perceptual loss function as a weighted sum of a ([VGG](https://paperswithcode.com/method/vgg)) content loss $l^{SR}\_{X}$ and an adversarial loss component $l^{SR}\_{Gen}$:

$$ l^{SR} = l^{SR}\_{X} + 10^{-3}l^{SR}\_{Gen} $$

SRGAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

ALQ and AMQ

Adaptive Gradient Quantization for Data-Parallel SGD

**CPM-2** is a 11 billion parameters pre-trained language model based on a standard Transformer architecture consisting of a bidirectional encoder and a unidirectional decoder. The model is pre-trained on WuDaoCorpus which contains 2.3TB cleaned Chinese data as well as 300GB cleaned English data. The pre-training process of CPM-2 can be divided into three stages: Chinese pre-training, bilingual pre-training, and MoE pre-training. Multi-stage training with knowledge inheritance can significantly reduce the computation cost.

Source	Adaptive Gradient Quantization for Data-Parallel SGD
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com