What is: SRU++?
Source | When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
SRU++ is a self-attentive recurrent unit that combines fast recurrence and attention for sequence modeling, extending the SRU unit. The key modification of SRU++ is to incorporate more expressive non-linear operations into the recurrent network. Specifically, given the input sequence represented as a matrix , the attention component computes the query, key and value representations using the following multiplications,
where are model parameters. is the attention dimension that is typically much smaller than . Note that the keys and values are computed using instead of such that the weight matrices and are significantly smaller.
Next, we compute a weighted average output using scaled dot-product attention:
The final output required by the elementwise recurrence is obtained by another linear projection,
where is a learned scalar and is a parameter matrix. is a residual connection which improves gradient propagation and stabilizes training. We initialize to zero and as a result,
initially falls back to a linear transformation of the input skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As grows, the attention mechanism can learn long-range dependencies for the model. In addition, can be interpreted as applying a matrix factorization trick with a small inner dimension , reducing the total number of parameters. The Figure compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++.
The last modification is adding layer normalization to each SRU++ layer. We apply normalization after the attention operation and before the matrix multiplication with
This implementation is post-layer normalization in which the normalization is added after the residual connection.