Viet-Anh on Software Logo

What is: Mogrifier LSTM?

SourceMogrifier LSTM
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

The Mogrifier LSTM is an extension to the LSTM where the LSTM’s input x\mathbf{x} is gated conditioned on the output of the previous step h_prev\mathbf{h}\_{prev}. Next, the gated input is used in a similar manner to gate the output of the previous time step. After a couple of rounds of this mutual gating, the last updated x\mathbf{x} and h_prev\mathbf{h}\_{prev} are fed to an LSTM.

In detail, the Mogrifier is an LSTM where two inputs x\mathbf{x} and h_prev\mathbf{h}\_{prev} modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: Mogrify(x,c_prev,h_prev)=LSTM(x,c_prev,h_prev) \text{Mogrify}\left(\mathbf{x}, \mathbf{c}\_{prev}, \mathbf{h}\_{prev}\right) = \text{LSTM}\left(\mathbf{x}^{↑}, \mathbf{c}\_{prev}, \mathbf{h}^{↑}\_{prev}\right) where the modulated inputs x\mathbf{x}^{↑} and h_prev\mathbf{h}^{↑}\_{prev} are defined as the highest indexed xi\mathbf{x}^{i} and hi_prev\mathbf{h}^{i}\_{prev}, respectively, from the interleaved sequences:

xi=2σ(Qihi1_prev)xi2 for odd i[1r]\mathbf{x}^{i} = 2\sigma\left(\mathbf{Q}^{i}\mathbf{h}^{i−1}\_{prev}\right) \odot x^{i-2} \text{ for odd } i \in \left[1 \dots r\right]

hi_prev=2σ(Rixi1)hi2_prev for even i[1r]\mathbf{h}^{i}\_{prev} = 2\sigma\left(\mathbf{R}^{i}\mathbf{x}^{i-1}\right) \odot \mathbf{h}^{i-2}\_{prev} \text{ for even } i \in \left[1 \dots r\right]

with x1=x\mathbf{x}^{-1} = \mathbf{x} and h0_prev=h_prev\mathbf{h}^{0}\_{prev} = \mathbf{h}\_{prev}. The number of "rounds", rNr \in \mathbb{N}, is a hyperparameter; r=0r = 0 recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized Qi\mathbf{Q}^{i}, Ri\mathbf{R}^{i} matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the Qi\mathbf{Q}^{i}, Ri\mathbf{R}^{i} matrices as products of low-rank matrices: Qi\mathbf{Q}^{i} = Qi_leftQi_right\mathbf{Q}^{i}\_{left}\mathbf{Q}^{i}\_{right} with QiRm×n\mathbf{Q}^{i} \in \mathbb{R}^{m\times{n}}, Qi_leftRm×k\mathbf{Q}^{i}\_{left} \in \mathbb{R}^{m\times{k}}, Qi_rightRk×n\mathbf{Q}^{i}\_{right} \in \mathbb{R}^{k\times{n}}, where k<min(m,n)k < \min\left(m, n\right) is the rank.