Viet-Anh on Software Logo

What is: FastPitch?

SourceFastPitch: Parallel Text-to-speech with Pitch Prediction
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let x=(x_1,,x_n)x=\left(x\_{1}, \ldots, x\_{n}\right) be the sequence of input lexical units, and y=(y_1,,y_t)\mathbf{y}=\left(y\_{1}, \ldots, y\_{t}\right) be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation h=FFTr(x)\mathbf{h}=\operatorname{FFTr}(\mathbf{x}). The hidden representation hh is used to make predictions about the duration and average pitch of every character with a 1-D CNN

d^= DurationPredictor (h),p^=PitchPredictor(h)\hat{\mathbf{d}}=\text { DurationPredictor }(\mathbf{h}), \quad \hat{\mathbf{p}}=\operatorname{PitchPredictor}(\mathbf{h})

where d^Nn\hat{\mathbf{d}} \in \mathbb{N}^{n} and p^Rn\hat{\mathbf{p}} \in \mathbb{R}^{n}. Next, the pitch is projected to match the dimensionality of the hidden representation hh \in Rn×d\mathbb{R}^{n \times d} and added to h\mathbf{h}. The resulting sum g\mathbf{g} is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence

g=h+PitchEmbedding(p)\mathbf{g}=\mathbf{h}+\operatorname{PitchEmbedding}(\mathbf{p})
y^=FFTr([g_1,,g_1_d_1,g_n,,g_nd_n])\hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g\_{1}, \ldots, g\_{1}}\_{d\_{1}}, \ldots \underbrace{g\_{n}, \ldots, g\_{n}}_{d\_{n}}]\right)

Ground truth p\mathbf{p} and d\mathbf{d} are used during training, and predicted p^\hat{\mathbf{p}} and d^\hat{\mathbf{d}} are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities

L=y^y_22+αp^p_22+γd^d_22\mathcal{L}=\|\hat{\mathbf{y}}-\mathbf{y}\|\_{2}^{2}+\alpha\|\hat{\mathbf{p}}-\mathbf{p}\|\_{2}^{2}+\gamma\|\hat{\mathbf{d}}-\mathbf{d}\|\_{2}^{2}