What is: Spatio-Temporal Attention LSTM?
Source | An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
In human action recognition, each type of action generally only depends on a few specific kinematic joints. Furthermore, over time, multiple actions may be performed. Motivated by these observations, Song et al. proposed a joint spatial and temporal attention network based on LSTM, to adaptively find discriminative features and keyframes. Its main attention-related components are a spatial attention sub-network, to select important regions, and a temporal attention sub-network, to select key frames. The spatial attention sub-network can be written as: \begin{align} s_{t} &= U_{s}\tanh(W_{xs}X_{t} + W_{hs}h_{t-1}^{s} + b_{si}) + b_{so} \end{align} \begin{align} \alpha_{t} &= \text{Softmax}(s_{t}) \end{align} \begin{align} Y_{t} &= \alpha_{t} X_{t} \end{align} where is the input feature at time , , , , and are learnable parameters, and is the hidden state at step . Note that use of the hidden state means the attention process takes temporal relationships into consideration.
The temporal attention sub-network is similar to the spatial branch and produces its attention map using: \begin{align} \beta_{t} = \delta(W_{xp}X_{t} + W_{hp}h_{t-1}^{p} + b_{p}). \end{align} It adopts a ReLU function instead of a normalization function for ease of optimization. It also uses a regularized objective function to improve convergence.
Overall, this paper presents a joint spatiotemporal attention method to focus on important joints and keyframes, with excellent results on the action recognition task.