What is: CTAL?
Source | CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
CTAL is a pre-training framework for strong audio-and-language representations with a Transformer, which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream