What is: Sliding Window Attention?

Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input.

More formally, in this attention pattern, given a fixed window size $w$ , each token attends to $\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\left(n×w\right)$ , which scales linearly with input sequence length $n$ . To make this attention pattern efficient, $w$ should be small compared with $n$ . But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)

In this case, with a transformer of $l$ layers, the receptive field size is $l × w$ (assuming $w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.

Source	Longformer: The Long-Document Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Sliding Window Attention?

Viet-Anh on Software