What is: Adaptive Span Transformer?
Source | Adaptive Attention Span in Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
The Adaptive Attention Span Transformer is a Transformer that utilises an improvement to the self-attention layer called adaptive masking that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.
Their proposals are based on the observation that, with the dense attention of a traditional Transformer, each attention head shares the same attention span (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).