What is: Jukebox?

Jukebox is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

Three separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\mathbf{h}\_{t}$ , which are then quantized to the closest codebook vectors $\mathbf{e}\_{z\_{t}}$ . The code $z\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.

Source	Jukebox: A Generative Model for Music
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com