What is: Deep Voice 3?

Deep Voice 3 (DV3) is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:

Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation.
Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.
Converter: A fully-convolutional post-processing network, which predicts final vocoder parameters (depending on the vocoder choice) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.

Source	Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Deep Voice 3?

Viet-Anh on Software