Multi-head attention consists of multiple attention layers (heads) in parallel with different linear
transformations on the queries, keys, values and outputs. **Multi-query attention** is identical except that the
different heads share a single set of keys and values.

**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.

Image Credit: [Apache MXNet](https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html)

ColorJitter

Multi-Query Attention

Fast Transformer Decoding: One Write-Head is All You Need

$s \cdot (\mathrm{ReLU}(x))^2 + b$

where $s \in \mathbb{R}$ and $b \in \mathbb{R}$ are shared for all channels and can be set as constants (s=0.8944, b=-0.4472) or learnable parameters.

Source	Fast Transformer Decoding: One Write-Head is All You Need
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com