Viet-Anh on Software Logo

What is: Sliding Window Attention?

SourceLongformer: The Long-Document Transformer
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with O(n2)O\left(n^{2}\right) time and memory complexity where nn is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input.

More formally, in this attention pattern, given a fixed window size ww, each token attends to 12w\frac{1}{2}w tokens on each side. The computation complexity of this pattern is O(n×w)O\left(n×w\right), which scales linearly with input sequence length nn. To make this attention pattern efficient, ww should be small compared with nn. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)

In this case, with a transformer of ll layers, the receptive field size is l×wl × w (assuming ww is fixed for all layers). Depending on the application, it might be helpful to use different values of ww for each layer to balance between efficiency and model representation capacity.