What is: Adafactor?
Source | Adafactor: Adaptive Learning Rates with Sublinear Memory Cost |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an matrix, this reduces the memory requirements from to .
Instead of defining the optimization algorithm in terms of absolute step sizes {}, the authors define the optimization algorithm in terms of relative step sizes {}, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant . The reason for this lower bound is to allow zero-initialized parameters to escape 0.
Proposed hyperparameters are: , , , , .