What is: DVD-GAN?
Source | Adversarial Video Generation on Complex Datasets |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
DVD-GAN is a generative adversarial network for video generation built upon the BigGAN architecture.
DVD-GAN uses two discriminators: a Spatial Discriminator and a Temporal Discriminator . critiques single frame content and structure by randomly sampling full-resolution frames and judging them individually. The temporal discriminator must provide with the learning signal to generate movement (not evaluated by ).
The input to consists of a Gaussian latent noise and a learned linear embedding of the desired class . Both inputs are 120-dimensional vectors. starts by computing an affine transformation of to a -shaped tensor. is used as the input to all class-conditional Batch Normalization layers throughout . This is then treated as the input (at each frame we would like to generate) to a Convolutional GRU.
This RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which are doubled (we skip upsampling in the first block). This is repeated a number of times, with the output of one RNN + residual group fed as the input to the next group, until the output tensors have the desired spatial dimensions.
The spatial discriminator functions almost identically to BigGAN’s discriminator. A score is calculated for each of the uniformly sampled frames (default ) and the output is the sum over per-frame scores. The temporal discriminator has a similar architecture, but pre-processes the real or generated video with a average-pooling downsampling function . Furthermore, the first two residual blocks of are 3-D, where every convolution is replaced with a 3-D convolution with a kernel size of . The rest of the architecture follows BigGAN.