Viet-Anh on Software Logo

What is: AdaBound?

SourceAdaptive Gradient Methods with Dynamic Bound of Learning Rate
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

AdaBound is a variant of the Adam stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to SGD (or with momentum) as the time step increases.

g_t=f_t(x_t)g\_{t} = \nabla{f}\_{t}\left(x\_{t}\right)

m_t=β_1tm_t1+(1β_1t)g_tm\_{t} = \beta\_{1t}m\_{t-1} + \left(1-\beta\_{1t}\right)g\_{t}

v_t=β_2v_t1+(1β_2)g_t2 and V_t=diag(v_t)v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2} \text{ and } V\_{t} = \text{diag}\left(v\_{t}\right)

η^_t=Clip(α/V_t,η_l(t),η_u(t)) and η_t=η^_t/t\hat{\eta}\_{t} = \text{Clip}\left(\alpha/\sqrt{V\_{t}}, \eta\_{l}\left(t\right), \eta\_{u}\left(t\right)\right) \text{ and } \eta\_{t} = \hat{\eta}\_{t}/\sqrt{t}

x_t+1=Π_F,diag(η_t1)(x_tη_tm_t)x\_{t+1} = \Pi\_{\mathcal{F}, \text{diag}\left(\eta\_{t}^{-1}\right)}\left(x\_{t} - \eta\_{t} \odot m\_{t} \right)

Where α\alpha is the initial step size, and ηl\eta_{l} and ηu\eta_{u} are the lower and upper bound functions respectively.