What is: Talking-Heads Attention?
Source | Talking-Heads Attention |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Talking-Heads Attention is a variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax operation. In multi-head attention, the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, and , which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension across the whole computation, we now have three separate heads dimensions: , , and , which can optionally differ in size (number of "heads"). refers to the number of attention heads for the keys and the queries. refers to the number of attention heads for the logits and the weights, and refers to the number of attention heads for the values.