Viet-Anh on Software Logo

What is: Spatial Attention Module?

SourceCBAM: Convolutional Block Attention Module
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

A Spatial Attention Module is a module for spatial attention in convolutional neural networks. It generates a spatial attention map by utilizing the inter-spatial relationship of features. Different from the channel attention, the spatial attention focuses on where is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. On the concatenated feature descriptor, we apply a convolution layer to generate a spatial attention map M_s(F)RH×W\textbf{M}\_{s}\left(F\right) \in \mathcal{R}^{H×W} which encodes where to emphasize or suppress.

We aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: Fs_avgR1×H×W\mathbf{F}^{s}\_{avg} \in \mathbb{R}^{1\times{H}\times{W}} and Fs_maxR1×H×W\mathbf{F}^{s}\_{max} \in \mathbb{R}^{1\times{H}\times{W}}. Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing the 2D spatial attention map. In short, the spatial attention is computed as:

M_s(F)=σ(f7x7([AvgPool(F);MaxPool(F)]))\textbf{M}\_{s}\left(F\right) = \sigma\left(f^{7x7}\left(\left[\text{AvgPool}\left(F\right);\text{MaxPool}\left(F\right)\right]\right)\right)

M_s(F)=σ(f7x7([Fs_avg;Fs_max]))\textbf{M}\_{s}\left(F\right) = \sigma\left(f^{7x7}\left(\left[\mathbf{F}^{s}\_{avg};\mathbf{F}^{s}\_{max} \right]\right)\right)

where σ\sigma denotes the sigmoid function and f7×7f^{7×7} represents a convolution operation with the filter size of 7 × 7.