What is: Convolution-enhanced image Transformer?
Source | Incorporating Convolution Designs into Visual Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Convolution-enhanced image Transformer (CeiT) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations.