**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https://paperswithcode.com/method/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https://paperswithcode.com/method/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3).

Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).

Image source: [Learning Deep Features for Discriminative Localization](https://paperswithcode.com/paper/learning-deep-features-for-discriminative)

Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks

ClariNet

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

**SNIPER** is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from [Batch Normalization](https://paperswithcode.com/method/batch-normalization) during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.

Source	ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: ClariNet?

Viet-Anh on Software