Viet-Anh on Software Logo

What is: Multiplicative Attention?

SourceDeep Learning for NLP Best Practices by Sebastian Ruder
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Multiplicative Attention is an attention mechanism where the alignment score function is calculated as:

fatt(hi,s_j)=h_iTW_as_jf_{att}\left(\textbf{h}_{i}, \textbf{s}\_{j}\right) = \mathbf{h}\_{i}^{T}\textbf{W}\_{a}\mathbf{s}\_{j}

Here h\mathbf{h} refers to the hidden states for the encoder/source, and s\mathbf{s} is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a softmax function of these alignment scores (ensuring it sums to 1).

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality dhd_{h} of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale fatt(hi,s_j)f_{att}\left(\textbf{h}_{i}, \textbf{s}\_{j}\right) by 1/d_h1/\sqrt{d\_{h}} as with scaled dot-product attention.