What is: TimeSformer?
Source | Is Space-Time Attention All You Need for Video Understanding? |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector