What is: Global and Sliding Window Attention?
Source | Longformer: The Long-Document Transformer |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Global and Sliding Window Attention is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with time and memory complexity where is the input sequence length and thus, is not efficient to scale to long inputs.
Since windowed and dilated attention patterns are not flexible enough to learn task-specific representations, the authors of the Longformer add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.