Viet-Anh on Software Logo

What is: AdaDelta?

SourceADADELTA: An Adaptive Learning Rate Method
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

AdaDelta is a stochastic optimization technique that allows for per-dimension learning rate method for SGD. It is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size ww.

Instead of inefficiently storing ww previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g2]_tE\left[g^{2}\right]\_{t} at time step tt then depends only on the previous average and current gradient:

E[g2]_t=γE[g2]_t1+(1γ)g2_tE\left[g^{2}\right]\_{t} = \gamma{E}\left[g^{2}\right]\_{t-1} + \left(1-\gamma\right)g^{2}\_{t}

Usually γ\gamma is set to around 0.90.9. Rewriting SGD updates in terms of the parameter update vector:

Δθt=ηg_t,i \Delta\theta_{t} = -\eta\cdot{g\_{t, i}} θ_t+1=θ_t+Δθt\theta\_{t+1} = \theta\_{t} + \Delta\theta_{t}

AdaDelta takes the form:

Δθt=ηE[g2]_t+ϵgt\Delta\theta_{t} = -\frac{\eta}{\sqrt{E\left[g^{2}\right]\_{t} + \epsilon}}g_{t}

The main advantage of AdaDelta is that we do not need to set a default learning rate.