What is: gMLP?
Source | Pay Attention to MLPs |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of blocks with identical size and structure. Let be the token representations with sequence length and dimension . Each block is defined as:
where is an activation function such as GeLU. and define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are and for ).
A key ingredient is , a layer which captures spatial interactions. When is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.
The overall block layout is inspired by inverted bottlenecks, which define as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in .