What is: LocalViT?

LocalViT aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by "DW"). To cope with the convolution operation, the conversation between sequence and image feature map is added by "Seq2Img" and "Img2Seq". The computation is as follows:

\mathbf{Y}^{r}=f\left(f\left(\mathbf{Z}^{r} \circledast \mathbf{W}_{1}^{r} \right) \circledast \mathbf{W}_d \right) \circledast \mathbf{W}_2^{r}

where $\mathbf{W}_{d} \in \mathbb{R}^{\gamma d \times 1 \times k \times k}$ is the kernel of the depth-wise convolution.

The input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.

Source	LocalViT: Bringing Locality to Vision Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: LocalViT?

Viet-Anh on Software