Viet-Anh on Software Logo

What is: AdaGrad?

Year2011
Data SourceCC BY-SA - https://paperswithcode.com

AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate η\eta at each time step tt for every parameter θ_i\theta\_{i} based on the past gradients for θ_i\theta\_{i}:

θ_t+1,i=θ_t,iηG_t,ii+ϵg_t,i\theta\_{t+1, i} = \theta\_{t, i} - \frac{\eta}{\sqrt{G\_{t, ii} + \epsilon}}g\_{t, i}

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of 0.010.01. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.

Image: Alec Radford