What is: Deep Voice 3?
Source | Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Deep Voice 3 (DV3) is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:
-
Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation.
-
Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.
-
Converter: A fully-convolutional post-processing network, which predicts final vocoder parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.
The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.