Viet-Anh on Software Logo

What is: CBHG?

SourceTacotron: Towards End-to-End Speech Synthesis
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

CBHG is a building block used in the Tacotron text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit (BiGRU).

The module is used to extract representations from sequences. The input sequence is first convolved with KK sets of 1-D convolutional filters, where the kk-th set contains C_kC\_{k} filters of width kk (i.e. k=1,2,,Kk = 1, 2, \dots , K). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The convolution outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. Batch normalization is used for all convolutional layers. The convolution outputs are fed into a multi-layer highway network to extract high-level features. Finally, a bidirectional GRU RNN is stacked on top to extract sequential features from both forward and backward context.