Viet-Anh on Software Logo

What is: RAdam?

SourceOn the Variance of the Adaptive Learning Rate and Beyond
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Rectified Adam, or RAdam, is a variant of the Adam stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:

g_t=_θf_t(θ_t1)g\_{t} = \nabla\_{\theta}f\_{t}\left(\theta\_{t-1}\right)

v_t=1/β_2v_t1+(1β_2)g2_tv\_{t} = 1/\beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g^{2}\_{t}

m_t=β_1m_t1+(1β_1)g_tm\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t}

m_t^=m_t/(1βt_1)\hat{m\_{t}} = m\_{t} / \left(1-\beta^{t}\_{1}\right)

ρ_t=ρ_2tβt_2/(1βt_2)\rho\_{t} = \rho\_{\infty} - 2t\beta^{t}\_{2}/\left(1-\beta^{t}\_{2}\right)

ρ=21β21\rho_{\infty} = \frac{2}{1-\beta_2} - 1

If the variance is tractable - ρ_t>4\rho\_{t} > 4 then:

...the adaptive learning rate is computed as:

l_t=(1βt_2)/v_t l\_{t} = \sqrt{\left(1-\beta^{t}\_{2}\right)/v\_{t}}

...the variance rectification term is calculated as:

r_t=(ρ_t4)(ρ_t2)ρ_(ρ_4)(ρ_2)ρ_t r\_{t} = \sqrt{\frac{(\rho\_{t}-4)(\rho\_{t}-2)\rho\_{\infty}}{(\rho\_{\infty}-4)(\rho\_{\infty}-2)\rho\_{t}}}

...and we update parameters with adaptive momentum:

θ_t=θ_t1α_tr_tm^_tl_t\theta\_{t} = \theta\_{t-1} - \alpha\_{t}r\_{t}\hat{m}\_{t}l\_{t}

If the variance isn't tractable we update instead with:

θ_t=θ_t1α_tm^_t\theta\_{t} = \theta\_{t-1} - \alpha\_{t}\hat{m}\_{t}