What is: Multi-Query Attention?
Source | Fast Transformer Decoding: One Write-Head is All You Need |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.