Viet-Anh on Software Logo

What is: Multi-Head Linear Attention?

SourceLinformer: Self-Attention with Linear Complexity
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices E_i,F_iRn×kE\_{i}, F\_{i} \in \mathbb{R}^{n\times{k}} when computing key and value. We first project the original (n×d)\left(n \times d\right)-dimensional key and value layers KW_iKKW\_{i}^{K} and VW_iVVW\_{i}^{V} into (k×d)\left(k\times{d}\right)-dimensional projected key and value layers. We then compute a (n×k)\left(n\times{k}\right) dimensional context mapping Pˉ\bar{P} using scaled-dot product attention:

head_iˉ=Attention(QWQ_i,E_iKW_iK,F_iVW_iV)\bar{\text{head}\_{i}} = \text{Attention}\left(QW^{Q}\_{i}, E\_{i}KW\_{i}^{K}, F\_{i}VW\_{i}^{V}\right)

head_iˉ=softmax(QWQ_i(E_iKW_iK)Td_k)F_iVW_iV\bar{\text{head}\_{i}} = \text{softmax}\left(\frac{QW^{Q}\_{i}\left(E\_{i}KW\_{i}^{K}\right)^{T}}{\sqrt{d\_{k}}}\right) \cdot F\_{i}VW\_{i}^{V}

Finally, we compute context embeddings for each head using Pˉ(F_iVW_iV)\bar{P} \cdot \left(F\_{i}{V}W\_{i}^{V}\right).