Viet-Anh on Software Logo

What is: Adam?

SourceAdam: A Method for Stochastic Optimization
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Adam is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of RMSProp and SGD w/th Momentum. The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.

The weight updates are performed as:

wt=wt1ηm^_tv^_t+ϵw_{t} = w_{t-1} - \eta\frac{\hat{m}\_{t}}{\sqrt{\hat{v}\_{t}} + \epsilon}

with

m^_t=mt1β1t\hat{m}\_{t} = \frac{m_{t}}{1-\beta^{t}_{1}}

v^_t=vt1β2t\hat{v}\_{t} = \frac{v_{t}}{1-\beta^{t}_{2}}

mt=β1mt1+(1β1)gtm_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t}

vt=β2vt1+(1β2)gt2v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}

η\eta is the step size/learning rate, around 1e-3 in the original paper. ϵ\epsilon is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. β1\beta_{1} and β2\beta_{2} are forgetting parameters, with typical values 0.9 and 0.999, respectively.