Viet-Anh on Software Logo

What is: Feedback Memory?

SourceAddressing Some Limitations of Transformers with Feedback Memory
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows a transformer to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:

zl_t=Attn(xl_t,[m_tτ,,m_t1])\mathbf{z}^{l}\_{t} = \text{Attn}\left(\mathbf{x}^{l}\_{t}, \left[\mathbf{m}\_{t−\tau}, \dots, \mathbf{m}\_{t−1}\right]\right)

where a memory vector m_t\mathbf{m}\_{t} is computed by summing the representations of each layer at the tt-th time step:

m_t=L_l=0Softmax(wl)x_tl\mathbf{m}\_{t} = \sum^{L}\_{l=0}\text{Softmax}\left(w^{l}\right)\mathbf{x}\_{t}^{l}

where wlw^{l} are learnable scalar parameters. Here l=0l = 0 corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation xl_t+1\mathbf{x}^{l}\_{t+1} based on past representations from any layer ll', while in a standard Transformer this is only true for l>ll > l'. This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.