What is: Spatial Gating Unit?
Source | Pay Attention to MLPs |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer to contain a contraction operation over the spatial dimension. The layer is formulated as the output of linear gating:
where denotes element-wise multiplication. For training stability, the authors find it critical to initialize as near-zero values and as ones, meaning that and therefore at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.
The authors find it further effective to split into two independent parts along the channel dimension for the gating function and for the multiplicative bypass:
They also normalize the input to which empirically improved the stability of large NLP models.