The  **Synthesizer** is a model that learns synthetic attention weights without token-token interactions. Unlike [Transformers](https://paperswithcode.com/method/transformer), the model eschews dot product self-attention but also content-based self-attention altogether. Synthesizer learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products. It is transformation-based, only relies on simple feed-forward layers, and completely dispenses with dot products and explicit token-token interactions. 

This new module employed by the Synthesizer is called "Synthetic Attention": a new way of learning to attend without explicitly attending (i.e., without dot product attention or [content-based attention](https://paperswithcode.com/method/content-based-attention)). Instead, Synthesizer generate the alignment matrix independent of token-token dependencies.

**Colorization Transformer** is a probabilistic [colorization](https://paperswithcode.com/method/colorization) model composed only of [axial self-attention blocks](https://paperswithcode.com/method/axial). The main advantages of these blocks are the ability to capture a global receptive field with only two layers and $\mathcal{O}(D\sqrt{D})$ instead of $\text{O}(D^{2})$ complexity. In order to enable colorization of high-resolution grayscale images, the task is decomposed into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution.

For coarse low resolution colorization, a conditional variant of [Axial Transformer](https://paperswithcode.com/method/axial) is applied. The authors leverage the semi-parallel sampling mechanism of Axial Transformers. Finally, fast parallel deterministic upsampling models are employed to super-resolve the coarsely colorized image into the final high resolution output.

Colorization Transformer

Synthesizer

Synthesizer: Rethinking Self-Attention in Transformer Models

**Cascade Mask R-CNN** extends [Cascade R-CNN](https://paperswithcode.com/method/cascade-r-cnn) to instance segmentation, by adding a
mask head to the cascade.

In the [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https://paperswithcode.com/method/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each
cascade stage. This maximizes the diversity of samples used to learn the mask prediction task. 

At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.

Source	Synthesizer: Rethinking Self-Attention in Transformer Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Synthesizer?

Viet-Anh on Software