**Tacotron** is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram
frames, which are then converted to waveforms.

**Computation Redistribution** is an [neural architecture search](https://paperswithcode.com/task/architecture-search) method for [face detection](https://paperswithcode.com/task/face-detection), which reallocates the computation between the backbone, neck and head of the model based on a predefined search methodology. Directly utilising the backbone of a classification network for scale-specific face detection can be sub-optimal. Therefore, [network structure search](https://paperswithcode.com/method/regnety) is used to reallocate the computation on the backbone, neck and head, under a wide range of flop regimes. The search method is applied to [RetinaNet](https://paperswithcode.com/method/retinanet), with [ResNet](https://paperswithcode.com/method/resnet) as backbone, [Path Aggregation Feature Pyramid Network](https://paperswithcode.com/method/pafpn) (PAFPN)  as the neck and stacked 3 × 3 [convolutional layers](https://paperswithcode.com/method/convolution) for the head. While the general structure is simple, the total number of possible networks in the search space is unwieldy. In the first step, the authors explore the reallocation of the computation within the backbone parts (i.e. stem, C2, C3, C4, and C5), while fixing the neck and head components. Based on the optimised computation distribution on the backbone they find, they further explore the reallocation of the computation across the backbone, neck and head.

Computation Redistribution

Sample and Computation Redistribution for Efficient Face Detection

Tacotron

Tacotron: Towards End-to-End Speech Synthesis

**Video-Audio-Text Transformer**, or **VATT**, is a framework for learning multimodal representations from unlabeled data using [convolution](https://paperswithcode.com/method/convolution)-free [Transformer](https://paperswithcode.com/method/transformer) architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from [BERT](https://paperswithcode.com/method/bert) and [ViT](https://paperswithcode.com/method/vision-transformer) except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.

VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.

Source	Tacotron: Towards End-to-End Speech Synthesis
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Tacotron?

Viet-Anh on Software