Viet-Anh on Software Logo

What is: Multi-DConv-Head Attention?

SourcePrimer: Searching for Efficient Transformers for Language Modeling
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Multi-DConv-Head Attention, or MDHA, is a type of Multi-Head Attention that utilizes depthwise convolutions after the multi-head projections. It is used in the Primer Transformer architecture.

Specifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query QQ, key KK and value VV in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection’s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical separable convolution, which the authors find to be less effective. They also find that wider depthwise convolution and standard convolution not only do not improve performance, but in several cases hurt it.

MDHA is similar to Convolutional Attention, which uses separable convolution instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.