Viet-Anh on Software Logo

What is: Scaled Dot-Product Attention?

SourceAttention Is All You Need
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Scaled dot-product attention is an attention mechanism where the dot products are scaled down by dk\sqrt{d_k}. Formally we have a query QQ, a key KK and a value VV and calculate the attention as:

Attention(Q,K,V)=softmax(QKTdk)V{\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V

If we assume that qq and kk are dkd_k-dimensional vectors whose components are independent random variables with mean 00 and variance 11, then their dot product, qk=i=1dkuiviq \cdot k = \sum_{i=1}^{d_k} u_iv_i, has mean 00 and variance dkd_k. Since we would prefer these values to have variance 11, we divide by dk\sqrt{d_k}.