The **MLP-Mixer** architecture (or “Mixer” for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

It accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.

**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. 

[Instance Normalization](https://paperswithcode.com/method/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https://paperswithcode.com/method/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https://paperswithcode.com/method/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:

$$
\textrm{AdaIN}(x, y)= \sigma(y)\left(\frac{x-\mu(x)}{\sigma(x)}\right)+\mu(y)
$$

Adaptive Instance Normalization

Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

MLP-Mixer

MLP-Mixer: An all-MLP Architecture for Vision

Inspired by the success of ResNet,
Wang et al. proposed
the very deep convolutional residual attention network (RAN) by 
combining an attention mechanism with residual connections. 

Each attention module stacked in a residual attention network 
can be divided into a mask branch and a trunk branch. 
The trunk branch processes features,
and can be implemented by any state-of-the-art structure
including a pre-activation residual unit and an inception block.
The mask branch uses a bottom-up top-down structure
to learn a mask of the same size that 
softly weights output features from the trunk branch. 
A sigmoid layer normalizes the output to $[0,1]$ after two $1\times 1$ convolution layers. Overall the residual attention mechanism can be written as

\begin{align}
s &= \sigma(Conv_{2}^{1\times 1}(Conv_{1}^{1\times 1}( h_\text{up}(h_\text{down}(X))))) 
\end{align}

\begin{align}
X_{out} &= s f(X) + f(X)
\end{align}
where $h_\text{up}$ is a bottom-up structure, 
using max-pooling several times after residual units
to increase the receptive field, while
$h_\text{down}$ is the top-down part using 
linear interpolation to keep the output size the 
same as the input feature map. 
There are also skip-connections between the two parts,
which are omitted from the formulation.
$f$ represents the trunk branch
which can be any state-of-the-art structure.

Inside each attention module, a
bottom-up top-down feedforward structure models
both spatial and cross-channel dependencies, 
 leading to a consistent performance improvement. 
Residual attention can be incorporated into
any deep network structure in an end-to-end training fashion.
However, the proposed bottom-up top-down structure fails to leverage global spatial information.  
Furthermore, directly predicting a 3D attention map  has high computational cost.

Source	MLP-Mixer: An all-MLP Architecture for Vision
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com