What is: Compressive Transformer?
Source | Compressive Transformers for Long-Range Sequence Modelling |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
The Compressive Transformer is an extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of Transformer-XL which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory.
At each time step , we discard the oldest compressed memories (FIFO) and then the oldest states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).