Viet-Anh on Software Logo

What is: Shuffle Transformer?

SourceShuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

The Shuffle Transformer Block consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), the Neighbor-Window Connection module (NWC), and the MLP module. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, a strategy which alternates between WMSA and Shuffle-WMSA in consecutive Shuffle Transformer blocks is proposed. The first window-based transformer block uses regular window partition strategy and the second window-based transformer block uses window-based selfattention with spatial shuffle. Besides, the Neighbor-Window Connection moduel (NWC) is added into each block for enhancing connections among neighborhood windows. Thus the proposed shuffle transformer block could build rich cross-window connections and augments representation. Finally, the consecutive Shuffle Transformer blocks are computed as:

xl=WMSA(BN(zl1))+zl1x^{l}=\mathbf{W M S A}\left(\mathbf{B N}\left(z^{l-1}\right)\right)+z^{l-1}

yl=NWC(xl)+xly^{l}=\mathbf{N W C}\left(x^{l}\right)+x^{l}

zl=MLP(BN(yl))+ylz^{l}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l}\right)\right)+y^{l}

xl+1=ShuffleWMSA(BN(zl))+zlx^{l+1}=\mathbf{S h u f f l e - W M S A}\left(\mathbf{B N}\left(z^{l}\right)\right)+z^{l}

yl+1=NWC(xl+1)+xl+1y^{l+1}=\mathbf{N W C}\left(x^{l+1}\right)+x^{l+1}

zl+1=MLP(BN(yl+1))+yl+1z^{l+1}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l+1}\right)\right)+y^{l+1}

where xlx^l, yly^l and zlz^l denote the output features of the (Shuffle-)WMSA module, the Neighbor-Window Connection module and the MLP module for block ll, respectively; WMSA and Shuffle-WMSA denote window-based multi-head self-attention without/with spatial shuffle, respectively.