Viet-Anh on Software Logo

What is: Spatio-Temporal Attention LSTM?

SourceAn End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

In human action recognition, each type of action generally only depends on a few specific kinematic joints. Furthermore, over time, multiple actions may be performed. Motivated by these observations, Song et al. proposed a joint spatial and temporal attention network based on LSTM, to adaptively find discriminative features and keyframes. Its main attention-related components are a spatial attention sub-network, to select important regions, and a temporal attention sub-network, to select key frames. The spatial attention sub-network can be written as: \begin{align} s_{t} &= U_{s}\tanh(W_{xs}X_{t} + W_{hs}h_{t-1}^{s} + b_{si}) + b_{so} \end{align} \begin{align} \alpha_{t} &= \text{Softmax}(s_{t}) \end{align} \begin{align} Y_{t} &= \alpha_{t} X_{t} \end{align} where XtX_{t} is the input feature at time tt, UsU_{s}, WhsW_{hs}, bsib_{si}, and bsob_{so} are learnable parameters, and ht1sh_{t-1}^{s} is the hidden state at step t1t-1. Note that use of the hidden state hh means the attention process takes temporal relationships into consideration.

The temporal attention sub-network is similar to the spatial branch and produces its attention map using: \begin{align} \beta_{t} = \delta(W_{xp}X_{t} + W_{hp}h_{t-1}^{p} + b_{p}). \end{align} It adopts a ReLU function instead of a normalization function for ease of optimization. It also uses a regularized objective function to improve convergence.

Overall, this paper presents a joint spatiotemporal attention method to focus on important joints and keyframes, with excellent results on the action recognition task.