**CPM-2** is a 11 billion parameters pre-trained language model based on a standard Transformer architecture consisting of a bidirectional encoder and a unidirectional decoder. The model is pre-trained on WuDaoCorpus which contains 2.3TB cleaned Chinese data as well as 300GB cleaned English data. The pre-training process of CPM-2 can be divided into three stages: Chinese pre-training, bilingual pre-training, and MoE pre-training. Multi-stage training with knowledge inheritance can significantly reduce the computation cost.

Many communication-efficient variants of [SGD](https://paperswithcode.com/method/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.

ALQ and AMQ

Adaptive Gradient Quantization for Data-Parallel SGD

CPM-2

CPM-2: Large-scale Cost-effective Pre-trained Language Models

A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass
images spaced an octave apart, plus a low-frequency residual. Formally, let $d\left(.\right)$ be a downsampling operation that blurs and decimates a $j \times j$ image $I$ so that $d\left(I\right)$ is a new image of size $\frac{j}{2} \times \frac{j}{2}$. Also, let $u\left(.\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\left(I\right)$ is a new image of size $2j \times 2j$. We first build a Gaussian pyramid $G\left(I\right) = \left[I\_{0}, I\_{1}, \dots, I\_{K}\right]$, where
$I\_{0} = I$ and $I\_{k}$ is $k$ repeated application of $d\left(.\right)$ to $I$. $K$ is the number of levels in the pyramid selected so that the final level has a minimal spatial extent ($\leq 8 \times 8$ pixels).

The coefficients $h\_{k}$ at each level $k$ of the Laplacian pyramid $L\left(I\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\left(.\right)$ so that the sizes are compatible:

$$ h\_{k} = \mathcal{L}\_{k}\left(I\right) = G\_{k}\left(I\right) − u\left(G\_{k+1}\left(I\right)\right) = I\_{k} − u\left(I\_{k+1}\right) $$

Intuitively, each level captures the image structure present at a particular scale. The final level of the
Laplacian pyramid $h\_{K}$ is not a difference image, but a low-frequency residual equal to the final
Gaussian pyramid level, i.e. $h\_{K} = I\_{K}$. Reconstruction from a Laplacian pyramid coefficients
$\left[h\_{1}, \dots, h\_{K}\right]$ is performed using the backward recurrence:

$$ I\_{k} = u\left(I\_{k+1}\right) + h\_{k} $$

which is started with $I\_{K} = h\_{K}$ and the reconstructed image being $I = I\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we return to the full-resolution image.
Source: [LAPGAN](https://paperswithcode.com/method/lapgan)

Image : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https://www.researchgate.net/figure/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)

Source	CPM-2: Large-scale Cost-effective Pre-trained Language Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com