Viet-Anh on Software Logo

What is: SAGAN Self-Attention Module?

SourceSelf-Attention Generative Adversarial Networks
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

The SAGAN Self-Attention Module is a self-attention module used in the Self-Attention GAN architecture for image synthesis. In the module, image features from the previous hidden layer xRCxN\textbf{x} \in \mathbb{R}^{C\text{x}N} are first transformed into two feature spaces f\textbf{f}, g\textbf{g} to calculate the attention, where f(x) = W_fx\textbf{f(x) = W}\_{\textbf{f}}{\textbf{x}}, g(x)=W_gx\textbf{g}(\textbf{x})=\textbf{W}\_{\textbf{g}}\textbf{x}. We then calculate:

βj,i=exp(sij)N_i=1exp(sij)\beta_{j, i} = \frac{\exp\left(s_{ij}\right)}{\sum^{N}\_{i=1}\exp\left(s_{ij}\right)}

where sij=f(x_i)Tg(x_i)\text{where } s_{ij} = \textbf{f}(\textbf{x}\_{i})^{T}\textbf{g}(\textbf{x}\_{i})

and βj,i\beta_{j, i} indicates the extent to which the model attends to the iith location when synthesizing the jjth region. Here, CC is the number of channels and NN is the number of feature locations of features from the previous hidden layer. The output of the attention layer is o=(o_1,o_2,,o_j,,o_N)RCxN\textbf{o} = \left(\textbf{o}\_{\textbf{1}}, \textbf{o}\_{\textbf{2}}, \ldots, \textbf{o}\_{\textbf{j}} , \ldots, \textbf{o}\_{\textbf{N}}\right) \in \mathbb{R}^{C\text{x}N} , where,

o_j=v(N_i=1βj,ih(x_i))\textbf{o}\_{\textbf{j}} = \textbf{v}\left(\sum^{N}\_{i=1}\beta_{j, i}\textbf{h}\left(\textbf{x}\_{\textbf{i}}\right)\right)

h(x_i)=W_hx_i\textbf{h}\left(\textbf{x}\_{\textbf{i}}\right) = \textbf{W}\_{\textbf{h}}\textbf{x}\_{\textbf{i}}

v(x_i)=W_vx_i\textbf{v}\left(\textbf{x}\_{\textbf{i}}\right) = \textbf{W}\_{\textbf{v}}\textbf{x}\_{\textbf{i}}

In the above formulation, W_gRCˉxC\textbf{W}\_{\textbf{g}} \in \mathbb{R}^{\bar{C}\text{x}C}, W_fRCˉxC\mathbf{W}\_{f} \in \mathbb{R}^{\bar{C}\text{x}C}, W_hRCˉxC\textbf{W}\_{\textbf{h}} \in \mathbb{R}^{\bar{C}\text{x}C} and W_vRCxCˉ\textbf{W}\_{\textbf{v}} \in \mathbb{R}^{C\text{x}\bar{C}} are the learned weight matrices, which are implemented as 11×11 convolutions. The authors choose Cˉ=C/8\bar{C} = C/8.

In addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,

y_i=γo_i+x_i\textbf{y}\_{\textbf{i}} = \gamma\textbf{o}\_{\textbf{i}} + \textbf{x}\_{\textbf{i}}

where γ\gamma is a learnable scalar and it is initialized as 0. Introducing γ\gamma allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.