Viet-Anh on Software Logo

What is: Dilated Sliding Window Attention?

SourceLongformer: The Long-Document Transformer
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Dilated Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with O(n2)O\left(n^{2}\right) time and memory complexity where nn is the input sequence length and thus, is not efficient to scale to long inputs.

Compared to a Sliding Window Attention pattern, we can further increase the receptive field without increasing computation by making the sliding window "dilated". This is analogous to dilated CNNs where the window has gaps of size dilation dd. Assuming a fixed dd and ww for all layers, the receptive field is l×d×wl × d × w, which can reach tens of thousands of tokens even for small values of dd.