What is: NormFormer?
Source | NormFormer: Improved Transformer Pretraining with Extra Normalization |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
NormFormer is a type of Pre-LN transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.