Viet-Anh on Software Logo

What is: Gaussian Error Linear Units?

SourceGaussian Error Linear Units (GELUs)
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

The Gaussian Error Linear Unit, or GELU, is an activation function. The GELU activation function is xΦ(x)x\Phi(x), where Φ(x)\Phi(x) the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in ReLUs (x1x>0x\mathbf{1}_{x>0}). Consequently the GELU can be thought of as a smoother ReLU.

GELU(x)=xP(Xx)=xΦ(x)=x12[1+erf(x/2)],\text{GELU}\left(x\right) = x{P}\left(X\leq{x}\right) = x\Phi\left(x\right) = x \cdot \frac{1}{2}\left[1 + \text{erf}(x/\sqrt{2})\right], if XN(0,1)X\sim \mathcal{N}(0,1).

One can approximate the GELU with 0.5x(1+tanh[2/π(x+0.044715x3)])0.5x\left(1+\tanh\left[\sqrt{2/\pi}\left(x + 0.044715x^{3}\right)\right]\right) or xσ(1.702x),x\sigma\left(1.702x\right), but PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the SiLU xσ(x)x\sigma(x) which was also coined in the paper that introduced the GELU.)

GELUs are used in GPT-3, BERT, and most other Transformers.