WaveRNN is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.
The overall computation in the WaveRNN is as follows (biases omitted for brevity):
x_t=[c_t−1,f_t−1,c_t]
u_t=σ(R_uh_t−1+I∗_ux_t)
r_t=σ(R_rh_t−1+I∗_rx_t)
e_t=τ(r_t⊙(R_eh_t−1)+I∗_ex_t)
h_t=u_t⋅h_t−1+(1−u_t)⋅e_t
y_c,y_f=split(h_t)
P(c_t)=softmax(O_2relu(O_1y_c))
P(f_t)=softmax(O_4relu(O_3y_f))
where the ∗ indicates a masked matrix whereby the last coarse input c_t is only connected to the fine part of the states u_t, r_t, e_t and h_t and thus only affects the fine output y_f. The coarse and fine parts c_t and f_t are encoded as scalars in [0,255] and scaled to the interval [−1,1]. The matrix R formed from the matrices R_u, R_r, R_e is computed as a single matrix-vector product to produce the contributions to all three gates u_t, mathbfr_t and e_t (a variant of the GRU cell. σ and τ are the standard sigmoid and tanh non-linearities.
Each part feeds into a softmax layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).