What is: FastPitch?
Source | FastPitch: Parallel Text-to-speech with Pitch Prediction |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let be the sequence of input lexical units, and be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation . The hidden representation is used to make predictions about the duration and average pitch of every character with a 1-D CNN
where and . Next, the pitch is projected to match the dimensionality of the hidden representation and added to . The resulting sum is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence
Ground truth and are used during training, and predicted and are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities