What is: SRU?

SRU, or Simple Recurrent Unit, is a recurrent neural unit with a light form of recurrence. SRU exhibits the same level of parallelism as convolution and feed-forward nets. This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs.

SRU also replaces the use of convolutions (i.e., ngram filters), as in QRNN and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters). Additionally, SRU improves the training of deep recurrent models by employing highway connections and a parameter initialization scheme tailored for gradient propagation in deep architectures.

A single layer of SRU involves the following computation:

\mathbf{f}\_{t} =\sigma\left(\mathbf{W}\_{f} \mathbf{x}\_{t}+\mathbf{v}\_{f} \odot \mathbf{c}\_{t-1}+\mathbf{b}\_{f}\right)

\mathbf{c}\_{t} =\mathbf{f}\_{t} \odot \mathbf{c}\_{t-1}+\left(1-\mathbf{f}\_{t}\right) \odot\left(\mathbf{W} \mathbf{x}\_{t}\right) \\

\mathbf{r}\_{t} =\sigma\left(\mathbf{W}\_{r} \mathbf{x}\_{t}+\mathbf{v}\_{r} \odot \mathbf{c}\_{t-1}+\mathbf{b}\_{r}\right) \\

\mathbf{h}\_{t} =\mathbf{r}\_{t} \odot \mathbf{c}\_{t}+\left(1-\mathbf{r}\_{t}\right) \odot \mathbf{x}\_{t}

where $\mathbf{W}, \mathbf{W}\_{f}$ and $\mathbf{W}\_{r}$ are parameter matrices and $\mathbf{v}\_{f}, \mathbf{v}\_{r}, \mathbf{b}\_{f}$ and $\mathbf{b}_{v}$ are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence and a highway network,

The light recurrence component successively reads the input vectors $\mathbf{x}\_{t}$ and computes the sequence of states $\mathbf{c}\_{t}$ capturing sequential information. The computation resembles other recurrent networks such as LSTM, GRU and RAN. Specifically, a forget gate $\mathbf{f}\_{t}$ controls the information flow and the state vector $\mathbf{c}\_{t}$ is determined by adaptively averaging the previous state $\mathbf{c}\_{t-1}$ and the current observation $\mathbf{W} \mathbf{x}_{+}$ according to $\mathbf{f}\_{t}$ .

Source	Simple Recurrent Units for Highly Parallelizable Recurrence
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: SRU?

Viet-Anh on Software