Viet-Anh on Software Logo

What is: TimeSformer?

SourceIs Space-Time Attention All You Need for Video Understanding?
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector