Viet-Anh on Software Logo

What is: Spatial and Channel-wise Attention-based Convolutional Neural Network?

SourceSCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

As CNN features are naturally spatial, channel-wise and multi-layer, Chen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). It was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map XX and the previous time step LSTM hidden state ht1Rdh_{t-1} \in \mathbb{R}^d, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state ht1h_{t-1}. The spatial attention model is:

\begin{align} a(h_{t-1}, X) &= \tanh(Conv_1^{1 \times 1}(X) \oplus W_1 h_{t-1}) \end{align}

\begin{align} \Phi_s(h_{t-1}, X) &= \text{Softmax}(Conv_2^{1 \times 1}(a(h_{t-1}, X)))
\end{align}

where \oplus represents addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state ht1h_{t-1}: \begin{align} b(h_{t-1}, X) &= \tanh((W_2\text{GAP}(X)+b_2)\oplus W_1h_{t-1}) \end{align} \begin{align} \Phi_c(h_{t-1}, X) &= \text{Softmax}(W_3(b(h_{t-1}, X))+b_3)
\end{align} Overall, the SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X \Phi_c(h_{t-1}, X)), \Phi_c(h_{t-1}, X)) \end{align} and if spatial attention comes first: \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X), \Phi_c(h_{t-1}, X \Phi_s(h_{t-1}, X))) \end{align} where f()f(\cdot) denotes the modulate function which takes the feature map XX and attention maps as input and then outputs the modulated feature map YY.

Unlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.