**FastSpeech2** is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.

The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward [Transformer](https://paperswithcode.com/method/transformer) block, which is a stack of [self-attention](https://paperswithcode.com/method/multi-head-attention) and 1D-[convolution](https://paperswithcode.com/method/convolution) as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.

**TinaFace** is a type of face detection method that is based on generic object detection. It consists of (a) Feature Extractor: [ResNet](https://paperswithcode.com/method/resnet)-50 and 6 level [Feature Pyramid Network](https://www.paperswithcode.com/method/fpn) to extract the multi-scale features of input image; (b) an Inception block to enhance receptive field; (c) Classification Head: 5 layers [FCN](https://paperswithcode.com/method/fcn) for classification of anchors; (d) Regression Head: 5 layers [FCN](https://paperswithcode.com/method/fcn) for regression of anchors to ground-truth objects boxes; (e) IoU Aware Head: a single convolutional layer for IoU prediction.

TinaFace

TinaFace: Strong but Simple Baseline for Face Detection

FastSpeech 2

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

**Sparse R-CNN** is a purely sparse method for object detection in images, without object positional candidates enumerating
on all(dense) image grids nor object queries interacting with global(dense) image feature.

As shown in the Figure, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For the example of the COCO dataset, 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in a Region Proposal Network ([RPN](https://paperswithcode.com/method/rpn)). These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by [RoIPool](https://paperswithcode.com/method/roi-pooling) or [RoIAlign](https://paperswithcode.com/method/roi-align).

Source	FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com