What is: Spatial and Channel-wise Attention-based Convolutional Neural Network?
Source | SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
As CNN features are naturally spatial, channel-wise and multi-layer, Chen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). It was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map and the previous time step LSTM hidden state , a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state . The spatial attention model is:
\begin{align} a(h_{t-1}, X) &= \tanh(Conv_1^{1 \times 1}(X) \oplus W_1 h_{t-1}) \end{align}
\begin{align}
\Phi_s(h_{t-1}, X) &= \text{Softmax}(Conv_2^{1 \times 1}(a(h_{t-1}, X)))
\end{align}
where represents addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state :
\begin{align}
b(h_{t-1}, X) &= \tanh((W_2\text{GAP}(X)+b_2)\oplus W_1h_{t-1})
\end{align}
\begin{align}
\Phi_c(h_{t-1}, X) &= \text{Softmax}(W_3(b(h_{t-1}, X))+b_3)
\end{align}
Overall, the SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have
\begin{align}
Y &= f(X,\Phi_s(h_{t-1}, X \Phi_c(h_{t-1}, X)), \Phi_c(h_{t-1}, X))
\end{align}
and if spatial attention comes first:
\begin{align}
Y &= f(X,\Phi_s(h_{t-1}, X), \Phi_c(h_{t-1}, X \Phi_s(h_{t-1}, X)))
\end{align}
where denotes the modulate function which takes the feature map and attention maps as input and then outputs the modulated feature map .
Unlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.