What is: Channel-wise Cross Attention?
Source | UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Channel-wise Cross Attention is a module for semantic segmentation used in the UCTransNet architecture. It is used to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features.
Mathematically, we take the -th level Transformer output and i-th level decoder feature map as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector with its th channel . We use this operation to embed the global spatial information and then generate the attention mask:
where and and being weights of two Linear layers and the ReLU operator . This operation in the equation above encodes the channel-wise dependencies. Following ECA-Net which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite to , where the activation indicates the importance of each channel. Finally, the masked is concatenated with the up-sampled features of the -th level decoder.