Viet-Anh on Software Logo

What is: AdamW?

SourceDecoupled Weight Decay Regularization
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, L_2L\_{2} regularization in Adam is usually implemented with the below modification where w_tw\_{t} is the rate of the weight decay at time tt:

g_t=f(θ_t)+w_tθ_t g\_{t} = \nabla{f\left(\theta\_{t}\right)} + w\_{t}\theta\_{t}

while AdamW adjusts the weight decay term to appear in the gradient update:

θ_t+1,i=θ_t,iη(1v^_t+ϵm^_t+w_t,iθ_t,i),t \theta\_{t+1, i} = \theta\_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}\_{t} + \epsilon}}\cdot{\hat{m}\_{t}} + w\_{t, i}\theta\_{t, i}\right), \forall{t}