What is: GBlock?
Source | High Fidelity Speech Synthesis with Adversarial Networks |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
GBlock is a type of residual block used in the GAN-TTS text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds to a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term in the single-speaker case, or the concatenation of and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for each BatchNorm instance.
A GBlock contains two skip connections, the first of which in GAN-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 convolution if the number of output channels is different from the input.