What is: Squared ReLU?

Squared ReLU is an activation function used in the Primer architecture in the feedforward block of the Transformer layer. It is simply squared ReLU activations.

The effectiveness of higher order polynomials can also be observed in other effective Transformer nonlinearities, such as GLU variants like ReGLU and point-wise activations like approximate GELU. However, squared ReLU has drastically different asymptotics as $x \rightarrow \inf$ compared to the most commonly used activation functions: ReLU, GELU and Swish. Squared ReLU does have significant overlap with ReGLU and in fact is equivalent when ReGLU’s $U$ and $V$ weight matrices are the same and squared ReLU is immediately preceded by a linear transformation with weight matrix $U$ . This leads the authors to believe that squared ReLUs capture the benefits of these GLU variants, while being simpler, without additional parameters, and delivering better quality.

Source	Primer: Searching for Efficient Transformers for Language Modeling
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Squared ReLU?

Viet-Anh on Software