Viet-Anh on Software Logo

What is: Dual Attention Network?

SourceDual Attention Network for Scene Segmentation
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

In the field of scene segmentation, encoder-decoder structures cannot make use of the global relationships between objects, whereas RNN-based structures heavily rely on the output of the long-term memorization. To address the above problems, Fu et al. proposed a novel framework, the dual attention network (DANet), for natural scene image segmentation. Unlike CBAM and BAM, it adopts a self-attention mechanism instead of simply stacking convolutions to compute the spatial attention map, which enables the network to capture global information directly.

DANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map XX, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map XX to C×(H×W)C\times (H \times W) whereupon the overall process can be written as \begin{align} Q,\quad K,\quad V &= W_qX,\quad W_kX,\quad W_vX \end{align} \begin{align} Y^\text{pos} &= X+ V\text{Softmax}(Q^TK) \end{align} \begin{align} Y^\text{chn} &= X+ \text{Softmax}(XX^T)X \end{align} \begin{align} Y &= Y^\text{pos} + Y^\text{chn} \end{align} where WqW_q, WkW_k, WvRC×CW_v \in \mathbb{R}^{C\times C} are used to generate new feature maps.

The position attention module enables DANet to capture long-range contextual information and adaptively integrate similar features at any scale from a global viewpoint, while the channel attention module is responsible for enhancing useful channels as well as suppressing noise. Taking spatial and channel relationships into consideration explicitly improves the feature representation for scene segmentation. However, it is computationally costly, especially for large input feature maps.