Viet-Anh on Software Logo

What is: Adaptive Span Transformer?

SourceAdaptive Attention Span in Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

The Adaptive Attention Span Transformer is a Transformer that utilises an improvement to the self-attention layer called adaptive masking that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.

Their proposals are based on the observation that, with the dense attention of a traditional Transformer, each attention head shares the same attention span SS (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).