Viet-Anh on Software Logo

What is: AdaShift?

SourceAdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

AdaShift is a type of adaptive stochastic optimizer that decorrelates v_tv\_{t} and g_tg\_{t} in Adam by temporal shifting, i.e., using temporally shifted gradient g_tng\_{t−n} to calculate v_tv\_{t}. The authors argue that an inappropriate correlation between gradient g_tg\_{t} and the second-moment term v_tv\_{t} exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.

The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:

g_t=f_t(θ_t)g\_{t} = \nabla{f\_{t}}\left(\theta\_{t}\right)

m_t=n1_i=0βi_1g_ti/n1_i=0βi_1m\_{t} = \sum^{n-1}\_{i=0}\beta^{i}\_{1}g\_{t-i}/\sum^{n-1}\_{i=0}\beta^{i}\_{1}

Then for i=1i=1 to MM:

v_t[i]=β_2v_t1[i]+(1β_2)ϕ(g2_tn[i])v\_{t}\left[i\right] = \beta\_{2}v\_{t-1}\left[i\right] + \left(1-\beta\_{2}\right)\phi\left(g^{2}\_{t-n}\left[i\right]\right)

θ_t[i]=θ_t1[i]α_t/v_t[i]m_t[i]\theta\_{t}\left[i\right] = \theta\_{t-1}\left[i\right] - \alpha\_{t}/\sqrt{v\_{t}\left[i\right]}\cdot{m\_{t}\left[i\right]}