What is: Dual Attention Network?
Source | Dual Attention Network for Scene Segmentation |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
In the field of scene segmentation, encoder-decoder structures cannot make use of the global relationships between objects, whereas RNN-based structures heavily rely on the output of the long-term memorization. To address the above problems, Fu et al. proposed a novel framework, the dual attention network (DANet), for natural scene image segmentation. Unlike CBAM and BAM, it adopts a self-attention mechanism instead of simply stacking convolutions to compute the spatial attention map, which enables the network to capture global information directly.
DANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map , convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map to whereupon the overall process can be written as \begin{align} Q,\quad K,\quad V &= W_qX,\quad W_kX,\quad W_vX \end{align} \begin{align} Y^\text{pos} &= X+ V\text{Softmax}(Q^TK) \end{align} \begin{align} Y^\text{chn} &= X+ \text{Softmax}(XX^T)X \end{align} \begin{align} Y &= Y^\text{pos} + Y^\text{chn} \end{align} where , , are used to generate new feature maps.
The position attention module enables DANet to capture long-range contextual information and adaptively integrate similar features at any scale from a global viewpoint, while the channel attention module is responsible for enhancing useful channels as well as suppressing noise. Taking spatial and channel relationships into consideration explicitly improves the feature representation for scene segmentation. However, it is computationally costly, especially for large input feature maps.