What is: Multiplicative Attention?
Source | Deep Learning for NLP Best Practices by Sebastian Ruder |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Multiplicative Attention is an attention mechanism where the alignment score function is calculated as:
Here refers to the hidden states for the encoder/source, and is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a softmax function of these alignment scores (ensuring it sums to 1).
Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale by as with scaled dot-product attention.