LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Therefore, video understanding has been one of the fundamental research topics in computer vision.Encouraged by the success of deep neural networks on image classification, many efforts have been made in recent years to extend the deep networks to video understanding. However, new challenges arise when the temporal characteristic of videos is taken into account. In this dissertation, we study two long-standing problems that play important roles in effective temporal modeling in videos: (1) How to extract motion information from raw video frames? (2) How to capture long-range dependencies in time and model their temporal dynamics?
To address the above issues, we first introduce hierarchical contrastive motion learning, a novel self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features, from low-level pixel movements to higher-level semantic dynamics, in a fully self-supervised manner.Next, we investigate the self-attention mechanism for long-range temporal modeling, and demonstrate that the entangled modeling of spatio-temporal information fails to capture temporal relationships among frames explicitly. To this end, we propose Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples.
While the performance of video action recognition has been significantly improved by the aforementioned methods, they are still restricted to model temporal information within short clips. To overcome this limitation, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Finally, we present a spatio-temporal progressive learning framework (STEP) for spatio-temporal action detection. Our approach performs a multi-step optimization process that progressively refines the initial proposals towards the final solution. In this way, our approach can effectively make use of long-term temporal information by handling the spatial displacement problem in long action tubes.