TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING

Wang, Xijun

TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING

Files

Wang_umd_0117E_25449.pdf (66.35 MB)

No. of downloads: 36

Date

2025

Authors

Wang, Xijun

Advisor

Manocha, Dinesh
Lin, Ming

DRUM DOI

https://doi.org/10.13016/berw-jiei

Abstract

“If a picture is worth a thousand words, what is a video worth?” Video information, due to its inherent richness and efficiency compared to language, plays a pivotal role in conveying complex information. However, video understanding faces numerous challenges, including selecting informative frames, addressing domain shifts, semantic grounding, reasoning and attention deficits, and significant computational burdens. Recent advancements in computer vision underscore the need to address these challenges through effective and efficient approaches, which are crucial for applications ranging from autonomous systems to human-computer interactions that require high accuracy and low latency. In this dissertation, we address five critical issues to overcome these challenges: dataset development, preprocessing, visual reasoning, multimodal alignment, and computational acceleration.

High-quality datasets serve as the foundational building blocks, providing diverse, comprehensive, and representative data to train models capable of handling real-world complexity. In this dissertation, we proposed METEOR dataset for tailored for autonomous driving applications in dense, heterogeneous, and unstructured traffic scenarios with rare and challenging conditions. Additionally, we developed DAVE, a comprehensive benchmark dataset specifically designed to enhance video understanding research for the safety of vulnerable road users in complex and unpredictable environments. Our analysis revealed substantial shortcomings of current object detection and behavior prediction models when tested against our METEOR and DAVE.

Complementing datasets, for preprocessing, we proposed AZTR incorporates an automatic zooming algorithm for dynamic target scaling and a temporal reasoning mechanism to accurately capture action sequences. Furthermore, we introduced MITFAS, an alignment and sampling method based on mutual information specifically designed to address challenges inherent to UAV video action recognition, including varying human resolutions, large positional changes between frames, and occluded action features.

For visual reasoning, we introduced SCP, which guides the model to explicitly learn input-invariant (prompt experts) and input-specific (data-dependent) prompt knowledge, effectively capturing discriminative patterns and significantly improving accuracy on challenging datasets. We also developed ICAR, a compatibility learning framework with a novel category-aware Flexible Bidirectional Transformer (FBT), which can effectively generate features across different domains based on visual similarity and complementarity for reasoning tasks.

For multimodal alignment, we proposed ViLA to address both efficient frame sampling and effective cross-modal alignment in a unified way. Finally, we propose Bi-VLM to explore ultra-low precision post-training quantization method to bridge the gap between computational demands and practical limitations. Our method employs a saliency-aware hybrid quantization algorithm combined with a non-uniform model weight partition strategy, significantly reducing computational costs without compromising much overall model performance.

URI (handle)

http://hdl.handle.net/1903/34623

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page