A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https://paperswithcode.com/method/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https://paperswithcode.com/method/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.

**PrIme Sample Attention (PISA)** directs the training of object detection frameworks towards prime samples. These are samples that play a key role in driving the detection performance. The authors define Hierarchical Local Rank (HLR) as a metric of importance. Specifically, they use IoU-HLR to rank positive samples and ScoreHLR to rank negative samples in each mini-batch. This ranking strategy places the positive samples with highest IoUs around each object and the negative samples with highest scores in each cluster to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. The authors also devise a classification-aware regression loss to jointly optimize the classification and regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.

PISA

Prime Sample Attention in Object Detection

Grouped Convolution

ImageNet Classification with Deep Convolutional Neural Networks

**Deep Voice 3 (DV3)** is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:

- Encoder: A fully-convolutional encoder, which converts textual features to an internal
learned representation.

- Decoder: A fully-convolutional causal decoder, which decodes the learned representation
with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.

- Converter: A fully-convolutional post-processing network, which predicts final vocoder
parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the
decoder, the converter is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.

Source	ImageNet Classification with Deep Convolutional Neural Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Grouped Convolution?

Viet-Anh on Software