Viet-Anh on Software Logo

What is: Manifold Mixup?

SourceManifold Mixup: Better Representations by Interpolating Hidden States
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Manifold Mixup is a regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations. It leverages semantic interpolations as an additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance.

Consider training a deep neural network f(x)=f_k(g_k(x))f\left(x\right) = f\_{k}\left(g\_{k}\left(x\right)\right), where g_kg\_{k} denotes the part of the neural network mapping the input data to the hidden representation at layer kk, and f_kf\_{k} denotes the part mapping such hidden representation to the output f(x)f\left(x\right). Training ff using Manifold Mixup is performed in five steps:

(1) Select a random layer kk from a set of eligible layers SS in the neural network. This set may include the input layer g_0(x)g\_{0}\left(x\right).

(2) Process two random data minibatches (x,y)\left(x, y\right) and (x,y)\left(x', y'\right) as usual, until reaching layer kk. This provides us with two intermediate minibatches (g_k(x),y)\left(g\_{k}\left(x\right), y\right) and (g_k(x),y)\left(g\_{k}\left(x'\right), y'\right).

(3) Perform Input Mixup on these intermediate minibatches. This produces the mixed minibatch:

(g~_k,y~)=(Mix_λ(g_k(x),g_k(x)),Mix_λ(y,y)),\left(\tilde{g}\_{k}, \tilde{y}\right) = \left(\text{Mix}\_{\lambda}\left(g\_{k}\left(x\right), g\_{k}\left(x'\right)\right), \text{Mix}\_{\lambda}\left(y, y'\right )\right),

where Mix_λ(a,b)=λa+(1λ)b\text{Mix}\_{\lambda}\left(a, b\right) = \lambda \cdot a + \left(1 − \lambda\right) \cdot b. Here, (y,y)\left(y, y' \right) are one-hot labels, and the mixing coefficient λBeta(α,α)\lambda \sim \text{Beta}\left(\alpha, \alpha\right) as in mixup. For instance, α=1.0\alpha = 1.0 is equivalent to sampling λU(0,1)\lambda \sim U\left(0, 1\right).

(4) Continue the forward pass in the network from layer kk until the output using the mixed minibatch (g~_k,y~)\left(\tilde{g}\_{k}, \tilde{y}\right).

(5) This output is used to compute the loss value and gradients that update all the parameters of the neural network.