Viet-Anh on Software Logo

What is: Bottleneck Attention Module?

SourceBAM: Bottleneck Attention Module
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Park et al. proposed the bottleneck attention module (BAM), aiming to efficiently improve the representational capability of networks. It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost.

For a given input feature map XX, BAM infers the channel attention scRCs_c \in \mathbb{R}^C and spatial attention ssRH×Ws_s\in \mathbb{R}^{H\times W} in two parallel streams, then sums the two attention maps after resizing both branch outputs to RC×H×W\mathbb{R}^{C\times H \times W}. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as \begin{align} s_c &= \text{BN}(W_2(W_1\text{GAP}(X)+b_1)+b_2) \end{align}

\begin{align} s_s &= BN(Conv_2^{1 \times 1}(DC_2^{3\times 3}(DC_1^{3 \times 3}(Conv_1^{1 \times 1}(X))))) \end{align} \begin{align} s &= \sigma(\text{Expand}(s_s)+\text{Expand}(s_c)) \end{align} \begin{align} Y &= s X+X \end{align} where WiW_i, bib_i denote weights and biases of fully connected layers respectively, Conv11×1Conv_{1}^{1\times 1} and Conv21×1Conv_{2}^{1\times 1} are convolution layers used for channel reduction. DCi3×3DC_i^{3\times 3} denotes a dilated convolution with 3×33\times 3 kernel, applied to utilize contextual information effectively. Expand\text{Expand} expands the attention maps sss_s and scs_c to RC×H×W\mathbb{R}^{C\times H\times W}.

BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.