Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. It was proposed in [Neighborhood Attention Transformer](https://paperswithcode.com/paper/neighborhood-attention-transformer) as an alternative to other local attention mechanisms used in Hierarchical Vision Transformers.

NA is in concept similar to [stand alone self attention (SASA)](https://paperswithcode.com/method/sasa), in that both can be implemented with a raster scan sliding window operation over the key value pair. However, NA would require a modification to handle corner pixels, which helps maintain a fixed receptive field size and an increased number of relative positions.

The primary challenge in experimenting with both NA and SASA has been computation. Simply extracting key values for each query is slow, takes up a large amount of memory, and is eventually intractable at scale. NA was therefore implemented through a new CUDA extension to PyTorch, [NATTEN](https://github.com/SHI-Labs/NATTEN).

**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:

- Placing the [layer normalization](https://paperswithcode.com/method/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.
- Replacing [residual connections](https://paperswithcode.com/method/residual-connection) with gating layers. The authors' experiments found that [GRUs](https://www.paperswithcode.com/method/gru) were the most effective form of gating.

GTrXL

Stabilizing Transformers for Reinforcement Learning

Neighborhood Attention

Neighborhood Attention Transformer

A **Harmonic Block** is an image model component that utilizes [Discrete Cosine Transform](https://paperswithcode.com/method/discrete-cosine-transform) (DCT) filters. Convolutional neural networks (CNNs) learn filters in order to capture local correlation patterns in feature space. In contrast, DCT has preset spectral filters, which can be better for compressing information (due to the presence of redundancy in the spectral domain).

DCT has been successfully used for JPEG encoding to transform image blocks into spectral representations to capture the most information with a small number of coefficients. Harmonic blocks learn how to optimally combine spectral coefficients at every layer to produce a fixed size representation defined as a weighted sum of responses to DCT filters. The use of DCT filters allows to address the task of model compression.

Source	Neighborhood Attention Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com