What is: T-Fixup?
Source | Improving Transformer Optimization Through Better Initialization |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:
- Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization for input embeddings where is the embedding dimension.
- Scale and matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings and in encoder and decoder by
- Scale and matrices in each encoder attention block and weight matrices in each encoder MLP block by