Viet-Anh on Software Logo

What is: Talking-Heads Attention?

SourceTalking-Heads Attention
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Talking-Heads Attention is a variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax operation. In multi-head attention, the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, P_lP\_{l} and P_wP\_{w}, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension hh across the whole computation, we now have three separate heads dimensions: h_kh\_{k}, hh, and h_vh\_{v}, which can optionally differ in size (number of "heads"). h_kh\_{k} refers to the number of attention heads for the keys and the queries. hh refers to the number of attention heads for the logits and the weights, and h_vh\_{v} refers to the number of attention heads for the values.