Viet-Anh on Software Logo

What is: gMLP?

SourcePay Attention to MLPs
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of LL blocks with identical size and structure. Let XRn×dX \in \mathbb{R}^{n \times d} be the token representations with sequence length nn and dimension dd. Each block is defined as:

Z=σ(XU),Z~=s(Z),Y=Z~VZ=\sigma(X U), \quad \tilde{Z}=s(Z), \quad Y=\tilde{Z} V

where σ\sigma is an activation function such as GeLU. UU and VV define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are 768×3072768 \times 3072 and 3072×7683072 \times 768 for BERTbase \text{BERT}_{\text {base }}).

A key ingredient is s()s(\cdot), a layer which captures spatial interactions. When ss is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good ss capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.

The overall block layout is inspired by inverted bottlenecks, which define s()s(\cdot) as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in s()s(\cdot).