Graph Self-Attention (GSA) is a self-attention module used in the BP-Transformer architecture, and is based on the graph attentional layer.
For a given node u, we update its representation according to its neighbour nodes, formulated as h_u←GSA(G,hu).
Let A(u) denote the set of the neighbour nodes of u in G, GSA(G,hu) is detailed as follows:
Au=concat({h_v∣v∈A(u)})
Qu_i=H_kWQ_i,K_iu=AuWK_i,Vu_i=AuW_iV
headu_i=softmax(dQu_iK_iuT)V_iu
GSA(G,hu)=[headu_1,…,headu_h]WO
where d is the dimension of h, and WQ_i, WK_i and WV_i are trainable parameters of the i-th attention head.