Viet-Anh on Software Logo

What is: Twins-PCPVT?

SourceTwins: Revisiting the Design of Spatial Attention in Vision Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Twins-PCPVT is a type of vision transformer that combines global attention, specifically the global sub-sampled attention as proposed in Pyramid Vision Transformer, with conditional position encodings (CPE) to replace the absolute position encodings used in PVT.

The position encoding generator (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D depth-wise convolution without batch normalization. For image-level classification, following CPVT, the class token is removed and global average pooling is used at the end of the stage. For other vision tasks, the design of PVT is followed.