Viet-Anh on Software Logo

What is: NesT?

SourceNested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

NesT stacks canonical transformer layers to conduct local self-attention on every image block independently, and then "nests" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size S×SS × S and number of block hierarchies TdT_d. All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.