The SAGAN Self-Attention Module is a self-attention module used in the Self-Attention GAN architecture for image synthesis. In the module, image features from the previous hidden layer x∈RCxN are first transformed into two feature spaces f, g to calculate the attention, where f(x) = W_fx, g(x)=W_gx. We then calculate:
βj,i=∑N_i=1exp(sij)exp(sij)
where sij=f(x_i)Tg(x_i)
and βj,i indicates the extent to which the model attends to the ith location when synthesizing the jth region. Here, C is the number of channels and N is the number of feature
locations of features from the previous hidden layer. The output of the attention layer is o=(o_1,o_2,…,o_j,…,o_N)∈RCxN , where,
o_j=v(∑N_i=1βj,ih(x_i))
h(x_i)=W_hx_i
v(x_i)=W_vx_i
In the above formulation, W_g∈RCˉxC, W_f∈RCˉxC, W_h∈RCˉxC and W_v∈RCxCˉ are the learned weight matrices, which are implemented as 1×1 convolutions. The authors choose Cˉ=C/8.
In addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,
y_i=γo_i+x_i
where γ is a learnable scalar and it is initialized as 0. Introducing γ allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.