Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key K and value V before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage i can be formulated as follows:
\text{SRA}(Q, K, V)=\text { Concat }\left(\operatorname{head}\_{0}, \ldots \text { head }\_{N\_{i}}\right) W^{O} $$
$$\text{ head}\_{j}=\text { Attention }\left(Q W\_{j}^{Q}, \operatorname{SR}(K) W\_{j}^{K}, \operatorname{SR}(V) W\_{j}^{V}\right)
where Concat (⋅) is the concatenation operation. W_jQ∈RC_i×d_head , W_jK∈RC_i×d_head , W_jV∈RC_i×d_head , and WO∈RC_i×C_i are linear projection parameters. N_i is the head number of the attention layer in Stage i. Therefore, the dimension of each head (i.e. d_head ) is equal to N_iC_i.SR(⋅) is the operation for reducing the spatial dimension of the input sequence (K or V ), which is written as:
SR(x)=Norm(Reshape(x,R_i)WS)
Here, x∈R(H_iW_i)×C_i represents a input sequence, and R_i denotes the reduction ratio of the attention layers in Stage i. Reshape (x,R_i) is an operation of reshaping the input sequence x to a sequence of size R_i2H_iW_i×(R_i2C_i). W_S∈R(R_i2C_i)×C_i is a linear projection that reduces the dimension of the input sequence to C_i. Norm(⋅) refers to layer normalization.