What is: Adaptively Spatial Feature Fusion?
Source | Learning Spatial Fusion for Single-Shot Object Detection |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
ASFF, or Adaptively Spatial Feature Fusion, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features.
ASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, i.e., some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.
Let denote the feature vector at the position on the feature maps resized from level to level . Following a feature resizing stage, we fuse the features at the corresponding level as follows:
where implies the -th vector of the output feature maps among channels. , and refer to the spatial importance weights for the feature maps at three different levels to level , which are adaptively learned by the network. Note that , and can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force and , and
Here , and are defined by using the softmax function with , and as control parameters respectively. We use convolution layers to compute the weight scalar maps , and from , and respectively, and they can thus be learned through standard back-propagation.
With this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of YOLOv3.