What is: Switch Transformer?
Source | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Switch Transformer is a sparsely-activated expert Transformer model that aims to simplify and improve over Mixture of Experts. Through distillation of sparse pre-trained and specialized fine-tuned models into small dense models, it reduces the model size by up to 99% while preserving 30% of the quality gains of the large sparse teacher. It also uses selective precision training that enables training with lower bfloat16 precision, as well as an initialization scheme that allows for scaling to a larger number of experts, and also increased regularization that improves sparse model fine-tuning and multi-task training.