What is: Sparse Transformer?
Source | Generating Long Sequences with Sparse Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to . Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage