Viet-Anh on Software Logo

What is: WaveRNN?

SourceEfficient Neural Audio Synthesis
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

WaveRNN is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.

The overall computation in the WaveRNN is as follows (biases omitted for brevity):

x_t=[c_t1,f_t1,c_t]\mathbf{x}\_{t} = \left[\mathbf{c}\_{t−1},\mathbf{f}\_{t−1}, \mathbf{c}\_{t}\right]

u_t=σ(R_uh_t1+I_ux_t)\mathbf{u}\_{t} = \sigma\left(\mathbf{R}\_{u}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{u}\mathbf{x}\_{t}\right)

r_t=σ(R_rh_t1+I_rx_t)\mathbf{r}\_{t} = \sigma\left(\mathbf{R}\_{r}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{r}\mathbf{x}\_{t}\right)

e_t=τ(r_t(R_eh_t1)+I_ex_t)\mathbf{e}\_{t} = \tau\left(\mathbf{r}\_{t} \odot \left(\mathbf{R}\_{e}\mathbf{h}\_{t-1}\right) + \mathbf{I}^{*}\_{e}\mathbf{x}\_{t} \right)

h_t=u_th_t1+(1u_t)e_t\mathbf{h}\_{t} = \mathbf{u}\_{t} \cdot \mathbf{h}\_{t-1} + \left(1-\mathbf{u}\_{t}\right) \cdot \mathbf{e}\_{t}

y_c,y_f=split(h_t)\mathbf{y}\_{c}, \mathbf{y}\_{f} = \text{split}\left(\mathbf{h}\_{t}\right)

P(c_t)=softmax(O_2relu(O_1y_c))P\left(\mathbf{c}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{2}\text{relu}\left(\mathbf{O}\_{1}\mathbf{y}\_{c}\right)\right)

P(f_t)=softmax(O_4relu(O_3y_f))P\left(\mathbf{f}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{4}\text{relu}\left(\mathbf{O}\_{3}\mathbf{y}\_{f}\right)\right)

where the * indicates a masked matrix whereby the last coarse input c_t\mathbf{c}\_{t} is only connected to the fine part of the states u_t\mathbf{u}\_{t}, r_t\mathbf{r}\_{t}, e_t\mathbf{e}\_{t} and h_t\mathbf{h}\_{t} and thus only affects the fine output y_f\mathbf{y}\_{f}. The coarse and fine parts c_t\mathbf{c}\_{t} and f_t\mathbf{f}\_{t} are encoded as scalars in [0,255]\left[0, 255\right] and scaled to the interval [1,1]\left[−1, 1\right]. The matrix R\mathbf{R} formed from the matrices R_u\mathbf{R}\_{u}, R_r\mathbf{R}\_{r}, R_e\mathbf{R}\_{e} is computed as a single matrix-vector product to produce the contributions to all three gates u_t\mathbf{u}\_{t}, mathbfr_tmathbf{r}\_{t} and e_t\mathbf{e}\_{t} (a variant of the GRU cell. σ\sigma and τ\tau are the standard sigmoid and tanh non-linearities.

Each part feeds into a softmax layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).