Viet-Anh on Software Logo

What is: Scale-wise Feature Aggregation Module?

SourceM2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

SFAM, or Scale-wise Feature Aggregation Module, is a feature extraction block from the M2Det architecture. It aims to aggregate the multi-level multi-scale features generated by Thinned U-Shaped Modules into a multi-level feature pyramid.

The first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as X=[X_1,X_2,,X_i]\mathbf{X} =[\mathbf{X}\_1,\mathbf{X}\_2,\dots,\mathbf{X}\_i], where X_i=Concat(x_i1,x_i2,,x_iL)RW_i×H_i×C\mathbf{X}\_i = \text{Concat}(\mathbf{x}\_i^1,\mathbf{x}\_i^2,\dots,\mathbf{x}\_i^L) \in \mathbb{R}^{W\_{i}\times H\_{i}\times C} refers to the features of the ii-th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths.

However, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use global average pooling to generate channel-wise statistics zRC\mathbf{z} \in \mathbb{R}^C at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:

s=F_ex(z,W)=σ(W_2δ(W_1z)),\mathbf{s} = \mathbf{F}\_{ex}(\mathbf{z},\mathbf{W}) = \sigma(\mathbf{W}\_{2} \delta(\mathbf{W}\_{1}\mathbf{z})),

where σ\sigma refers to the ReLU function, δ\delta refers to the sigmoid function, W_1RCr×C\mathbf{W}\_{1} \in \mathbb{R}^{\frac{C}{r}\times C} , W_2RC×Cr\mathbf{W}\_{2} \in \mathbb{R}^{C\times \frac{C}{r}}, r is the reduction ratio (r=16r=16 in our experiments). The final output is obtained by reweighting the input X\mathbf{X} with activation s\mathbf{s}:

X~ic=F_scale(X_ic,sc)=scXic,\tilde{\mathbf{X}}_i^c = \mathbf{F}\_{scale}(\mathbf{X}\_i^c,s_c) = s_c \cdot \mathbf{X}_i^c,

where X_i~=[X~_i1,X~_i2,...,X~_iC]\tilde{\mathbf{X}\_i} = [\tilde{\mathbf{X}}\_i^1,\tilde{\mathbf{X}}\_i^2,...,\tilde{\mathbf{X}}\_i^C], each of the features is enhanced or weakened by the rescaling operation.