As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel [SGD](https://paperswithcode.com/method/sgd) is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

**VoTr** is a [Transformer](https://paperswithcode.com/method/transformer)-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.

Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for [multi-head attention](https://paperswithcode.com/method/multi-head-attention) in those two modules: Local Attention and Dilated Attention. Furthermore [Fast Voxel Query](https://paperswithcode.com/method/fast-voxel-query) is used to accelerate the querying process in multi-head attention.

VoTr

Voxel Transformer for 3D Object Detection

NUQSGD

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

**SRU++** is a self-attentive recurrent unit that combines fast recurrence and attention for sequence modeling, extending the [SRU](https://www.paperswithcode.com/method/sru) unit. The key modification of SRU++ is to incorporate more expressive non-linear operations into the recurrent network. Specifically, given the input sequence represented as a matrix $\mathbf{X} \in \mathbb{R}^{L \times d}$, the attention component computes the query, key and value representations using the following multiplications,

$$
\mathbf{Q} =\mathbf{W}^{q} \mathbf{X}^{\top} 
$$

$$
\mathbf{K} =\mathbf{W}^{k} \mathbf{Q} \\
$$

$$
\mathbf{V} =\mathbf{W}^{v} \mathbf{Q}
$$

where $\mathbf{W}^{q} \in \mathbb{R}^{d^{\prime} \times d}, \mathbf{W}^{k}, \mathbf{W}^{v} \in \mathbb{R}^{d^{\prime} \times d^{\prime}}$ are model parameters. $d^{\prime}$ is the attention dimension that is typically much smaller than $d$. Note that the keys $\mathbf{K}$ and values $\mathbf{V}$ are computed using $\mathbf{Q}$ instead of $\mathbf{X}$ such that the weight matrices $\mathbf{W}^{k}$ and $\mathbf{W}^{v}$ are significantly smaller. 

Next, we compute a weighted average output $\mathbf{A} \in \mathbb{R}^{d^{\prime} \times L}$ using [scaled dot-product attention](https://paperswithcode.com/method/scaled):

$$
\mathbf{A}^{\top}=\operatorname{softmax}\left(\frac{\mathbf{Q}^{\top} \mathbf{K}}{\sqrt{d^{\prime}}}\right) \mathbf{V}^{\top}
$$

The final output $U$ required by the elementwise recurrence is obtained by another linear projection,

$$
\mathbf{U}^{\top}=\mathbf{W}^{o}(\mathbf{Q}+\alpha \cdot \mathbf{A})
$$

where $\alpha \in \mathbb{R}$ is a learned scalar and $\mathbf{W}\_{o} \in \mathbb{R}^{3 d \times d^{\prime}}$ is a parameter matrix. $\mathbf{Q}+\alpha \cdot \mathbf{A}$ is a [residual connection](https://paperswithcode.com/method/residual-connection) which improves gradient propagation and stabilizes training. We initialize $\alpha$ to zero and as a result,

$$
\mathbf{U}^{\top}=\mathbf{W}^{o} \mathbf{Q}=\left(\mathbf{W}^{o} \mathbf{W}^{q}\right) \mathbf{X}^{\top}
$$

initially falls back to a linear transformation of the input $X$ skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As $|\alpha|$ grows, the attention mechanism can learn long-range dependencies for the model. In addition, $\mathbf{W}^{o} \mathbf{W}^{q}$ can be interpreted as applying a matrix factorization trick with a small inner dimension $d^{\prime}<d$, reducing the total number of parameters. The Figure compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++.

The last modification is adding [layer normalization](https://paperswithcode.com/method/layer-normalization) to each SRU++ layer. We apply normalization after the attention operation and before the matrix multiplication with $\mathbf{W}^{o}$

$$
\mathbf{U}^{\top}=\mathbf{W}^{o} \operatorname{layernorm}(\mathbf{Q}+\alpha \cdot \mathbf{A})
$$

This implementation is post-layer normalization in which the normalization is added after the residual connection.

Source	NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com