Towards Effective Temporal Modeling for Video Understanding

No Thumbnail Available

Files

Publication or External Link

Date

2023

Authors

Citation

Abstract

The exponential increase of video content available in the world has led to a growing need for advanced systems capable of autonomously analyzing and interpreting video data. Nowadays, video understanding has been a fundamental research topic in computer vision, focusing on the effective extraction and analysis of information from videos. Compared to the image modality, the video modality significantly differs from it with additional temporal dependencies, which provide crucial clues to help understand what happens across time. Therefore, how to effectively model temporal relationships is of vital importance for video understanding research.

In this thesis, we aim to advance the field of temporal modeling, making video understanding systems more reliable, accurate, and flexible for various applications. In the first part, we introduce three strategies to model temporal dependencies in different downstream tasks and scenarios, including action recognition, temporal action localization, and video summarization. In the second part, we present a comprehensive large multimodal model for video understanding, constructed using recent advanced large language models. This model is capable of handling various video understanding tasks within a unified and integrated framework.

Specifically, we first propose a Global Temporal Attention (GTA) network, which models global temporal relationships by the decoupled spatial and temporal attention operation. This approach significantly enhances the model's capability to recognize actions in videos with reduced computational cost. However, raw videos in the real world are untrimmed and contain many background frames, before correctly recognizing and classifying the action labels, we need to detect and localize the actions of interests in long untrimmed videos. Therefore, we introduce ASM-Loc framework to effectively localize action of interest under the weakly-supervised setting, which significantly reduces the need for labor-intensive manual labeling. Then, given that real-world data often comprises multiple modalities of information, such as video and text. We present A2Summ, which is aimed at tackling the challenge of summarizing both long video and text sequences with time correspondence, providing a solution for summarizing multimodal data.

In the first part of this thesis, we focus on developing specialized models for individual video understanding tasks. Each model is specifically designed for a particular task, which limits their generalization ability to other areas and makes them less practical for diverse real-world applications. To address this limitation, in the second part of the thesis, we further present a unified large multimodal model, capable of handling multiple video understanding tasks. This model is built upon the foundation of powerful large language models, making it adaptable for a wide range of video understanding tasks, including video classification, online action prediction, video question answering, and video captioning. This method offers a more versatile and general solution, significantly enhancing the applicability of our models in real-world video understanding scenarios.

Notes

Rights