What is: Patch Merger Module?
Source | Learning to Merge Tokens in Vision Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.
Mathematically,
Image and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.