What is: Temporal Adaptive Module?
Source | TAM: Temporal Adaptive Module for Video Recognition |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR.
TAM has two branches, a local branch and a global branch. Given the input feature map , global spatial average pooling is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the -th channel, the kernel can be written as
\begin{align} \Theta_c = \text{Softmax}(\text{FC}_2(\delta(\text{FC}_1(\text{GAP}(X)_c)))) \end{align}
where and is the adaptive kernel size. Finally, TAM convolves the adaptive kernel with : \begin{align} Y = \Theta \otimes X^1 \end{align}
With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.