Viet-Anh on Software Logo

What is: LAMB?

SourceLarge Batch Optimization for Deep Learning: Training BERT in 76 minutes
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

LAMB is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses Adam as the base algorithm and then forms an update as:

r_t=m_tv_t+ϵr\_{t} = \frac{m\_{t}}{\sqrt{v\_{t}} + \epsilon} x_t+1(i)=x_t(i)η_tϕ(x_t(i))m_t(i)(r_t(i)+λx_t(i))x\_{t+1}^{\left(i\right)} = x\_{t}^{\left(i\right)} - \eta\_{t}\frac{\phi\left(|| x\_{t}^{\left(i\right)} ||\right)}{|| m\_{t}^{\left(i\right)} || }\left(r\_{t}^{\left(i\right)}+\lambda{x\_{t}^{\left(i\right)}}\right)

Unlike LARS, the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.