Viet-Anh on Software Logo

What is: LocalViT?

SourceLocalViT: Bringing Locality to Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

LocalViT aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by "DW"). To cope with the convolution operation, the conversation between sequence and image feature map is added by "Seq2Img" and "Img2Seq". The computation is as follows:

Yr=f(f(ZrW1r)Wd)W2r\mathbf{Y}^{r}=f\left(f\left(\mathbf{Z}^{r} \circledast \mathbf{W}_{1}^{r} \right) \circledast \mathbf{W}_d \right) \circledast \mathbf{W}_2^{r}

where WdRγd×1×k×k\mathbf{W}_{d} \in \mathbb{R}^{\gamma d \times 1 \times k \times k} is the kernel of the depth-wise convolution.

The input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.