What is: Adaptive Masking?
Source | Adaptive Attention Span in Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Adaptive Masking is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in Multi-Head Attention, a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a distance to a value in . Adaptive masking takes the following soft masking function parametrized by a real value in :
where is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by Jernite et al. (2017). The attention weights from are then computed on the masked span:
A penalization is added on the parameters for each attention head of the model to the loss function:
where is the regularization hyperparameter, and is the number of heads in each layer. This formulation is differentiable in the parameters , and learnt jointly with the rest of the model.