**Primal Wasserstein Imitation Learning**, or **PWIL**, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.

**AdaMax** is a generalisation of [Adam](https://paperswithcode.com/method/adam) from the $l\_{2}$ norm to the $l\_{\infty}$ norm. Define:

$$ u\_{t} = \beta^{\infty}\_{2}v\_{t-1} + \left(1-\beta^{\infty}\_{2}\right)|g\_{t}|^{\infty}$$

$$ = \max\left(\beta\_{2}\cdot{v}\_{t-1}, |g\_{t}|\right)$$

We can plug into the Adam update equation by replacing $\sqrt{\hat{v}_{t} + \epsilon}$ with $u\_{t}$ to obtain the AdaMax update rule:

$$ \theta\_{t+1} = \theta\_{t} - \frac{\eta}{u\_{t}}\hat{m}\_{t} $$

Common default values are $\eta = 0.002$ and $\beta\_{1}=0.9$ and $\beta\_{2}=0.999$.

AdaMax

Adam: A Method for Stochastic Optimization

PWIL

Primal Wasserstein Imitation Learning

As CNN features are naturally spatial, channel-wise and multi-layer, 
Chen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). 
It was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \in \mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The  spatial attention model is:

\begin{align}
a(h_{t-1}, X) &= \tanh(Conv_1^{1 \times 1}(X) \oplus W_1 h_{t-1})
\end{align}

\begin{align}
\Phi_s(h_{t-1}, X) &= \text{Softmax}(Conv_2^{1 \times 1}(a(h_{t-1}, X)))    
\end{align}

where $\oplus$ represents  addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$:
\begin{align}
b(h_{t-1}, X) &= \tanh((W_2\text{GAP}(X)+b_2)\oplus W_1h_{t-1})
\end{align}
\begin{align}
\Phi_c(h_{t-1}, X) &= \text{Softmax}(W_3(b(h_{t-1}, X))+b_3)    
\end{align}
Overall, the  SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have
\begin{align}
Y &= f(X,\Phi_s(h_{t-1}, X \Phi_c(h_{t-1}, X)), \Phi_c(h_{t-1}, X)) 
\end{align}
and  if spatial attention comes first:
\begin{align}
Y &= f(X,\Phi_s(h_{t-1}, X), \Phi_c(h_{t-1}, X \Phi_s(h_{t-1}, X)))
\end{align}
where $f(\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.

Unlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.

Source	Primal Wasserstein Imitation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Primal Wasserstein Imitation Learning?

Viet-Anh on Software