Viet-Anh on Software Logo

What is: spatial transformer networks?

SourceSpatial Transformer Networks
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.

Taking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a 2×32 \times 3 learneable affine matrix:

\begin{align} A = f_\text{loc}(U) \end{align} \begin{align} x_i^s = A x_i^t \end{align}

Here, UU is the input feature map, and flocf_\text{loc} can be any differentiable function, such as a lightweight fully-connected network or convolutional neural network. xisx_{i}^{s} is coordinates in the output feature map, while xitx_{i}^{t} is corresponding coordinates in the input feature map and the AA matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence. To ensure that the whole process is differentiable and can be updated in an end-to-end manner, bilinear sampling is used to sample the input features.

STNs focus on discriminative regions automatically and learn invariance to some geometric transformations.