What is: AdaShift?
Source | AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
AdaShift is a type of adaptive stochastic optimizer that decorrelates and in Adam by temporal shifting, i.e., using temporally shifted gradient to calculate . The authors argue that an inappropriate correlation between gradient and the second-moment term exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.
The AdaShift updates, based on the idea of temporal independence between gradients, are as follows:
Then for to :