Viet-Anh on Software Logo

What is: Deformable Attention Module?

SourceDeformable DETR: Deformable Transformers for End-to-End Object Detection
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.

Given an input feature map xRC×H×Wx \in \mathbb{R}^{C \times H \times W}, let qq index a query element with content feature z_q\mathbf{z}\_{q} and a 2-d reference point p_q\mathbf{p}\_{q}, the deformable attention feature is calculated by:

where mm indexes the attention head, kk indexes the sampled keys, and KK is the total sampled key number (KHW).Δpmqk(K \ll H W) . \Delta p_{m q k} and AmqkA_{m q k} denote the sampling offset and attention weight of the kth k^{\text {th }} sampling point in the mth m^{\text {th }} attention head, respectively. The scalar attention weight AmqkA_{m q k} lies in the range [0,1][0,1], normalized by k=1KAmqk=1.ΔpmqkR2\sum_{k=1}^{K} A_{m q k}=1 . \Delta \mathbf{p}_{m q k} \in \mathbb{R}^{2} are of 2-d real numbers with unconstrained range. As p_q+Δp_mqkp\_{q}+\Delta p\_{m q k} is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing x(p_q+Δp_mqk)\mathbf{x}\left(\mathbf{p}\_{q}+\Delta \mathbf{p}\_{m q k}\right). Both Δp_mqk\Delta \mathbf{p}\_{m q k} and A_mqkA\_{m q k} are obtained via linear projection over the query feature z_q.z\_{q} . In implementation, the query feature z_qz\_{q} is fed to a linear projection operator of 3MK3 M K channels, where the first 2MK2 M K channels encode the sampling offsets Δp_mqk\Delta p\_{m q k}, and the remaining MKM K channels are fed to a softmax operator to obtain the attention weights A_mqkA\_{m q k}.