Viet-Anh on Software Logo

What is: Adafactor?

SourceAdafactor: Adaptive Learning Rates with Sublinear Memory Cost
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an n×mn \times m matrix, this reduces the memory requirements from O(nm)O(n m) to O(n+m)O(n + m).

Instead of defining the optimization algorithm in terms of absolute step sizes {αt\alpha_t}_t=1T\_{t=1}^T, the authors define the optimization algorithm in terms of relative step sizes {ρt\rho_t}_t=1T\_{t=1}^T, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant ϵ2\epsilon_2. The reason for this lower bound is to allow zero-initialized parameters to escape 0.

Proposed hyperparameters are: ϵ_1=1030\epsilon\_{1} = 10^{-30}, ϵ_2=103\epsilon\_{2} = 10^{-3}, d=1d=1, p_t=min(102,1t)p\_{t} = \min\left(10^{-2}, \frac{1}{\sqrt{t}}\right), β^_2_t=1t0.8\hat{\beta}\_{2\_{t}} = 1 - t^{-0.8}.