Viet-Anh on Software Logo

What is: GBlock?

SourceHigh Fidelity Speech Synthesis with Adversarial Networks
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

GBlock is a type of residual block used in the GAN-TTS text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds to a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of GG is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term zN(0,I_128)z \sim N\left(0, \mathbf{I}\_{128}\right) in the single-speaker case, or the concatenation of zz and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for each BatchNorm instance.

A GBlock contains two skip connections, the first of which in GAN-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 convolution if the number of output channels is different from the input.