TOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDING

dc.contributor.advisorManocha, Dineshen_US
dc.contributor.advisorLin, Mingen_US
dc.contributor.authorWang, Xijunen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-09-15T05:32:10Z
dc.date.issued2025en_US
dc.description.abstract“If a picture is worth a thousand words, what is a video worth?” Video information, due to its inherent richness and efficiency compared to language, plays a pivotal role in conveying complex information. However, video understanding faces numerous challenges, including selecting informative frames, addressing domain shifts, semantic grounding, reasoning and attention deficits, and significant computational burdens. Recent advancements in computer vision underscore the need to address these challenges through effective and efficient approaches, which are crucial for applications ranging from autonomous systems to human-computer interactions that require high accuracy and low latency. In this dissertation, we address five critical issues to overcome these challenges: dataset development, preprocessing, visual reasoning, multimodal alignment, and computational acceleration. High-quality datasets serve as the foundational building blocks, providing diverse, comprehensive, and representative data to train models capable of handling real-world complexity. In this dissertation, we proposed METEOR dataset for tailored for autonomous driving applications in dense, heterogeneous, and unstructured traffic scenarios with rare and challenging conditions. Additionally, we developed DAVE, a comprehensive benchmark dataset specifically designed to enhance video understanding research for the safety of vulnerable road users in complex and unpredictable environments. Our analysis revealed substantial shortcomings of current object detection and behavior prediction models when tested against our METEOR and DAVE. Complementing datasets, for preprocessing, we proposed AZTR incorporates an automatic zooming algorithm for dynamic target scaling and a temporal reasoning mechanism to accurately capture action sequences. Furthermore, we introduced MITFAS, an alignment and sampling method based on mutual information specifically designed to address challenges inherent to UAV video action recognition, including varying human resolutions, large positional changes between frames, and occluded action features. For visual reasoning, we introduced SCP, which guides the model to explicitly learn input-invariant (prompt experts) and input-specific (data-dependent) prompt knowledge, effectively capturing discriminative patterns and significantly improving accuracy on challenging datasets. We also developed ICAR, a compatibility learning framework with a novel category-aware Flexible Bidirectional Transformer (FBT), which can effectively generate features across different domains based on visual similarity and complementarity for reasoning tasks. For multimodal alignment, we proposed ViLA to address both efficient frame sampling and effective cross-modal alignment in a unified way. Finally, we propose Bi-VLM to explore ultra-low precision post-training quantization method to bridge the gap between computational demands and practical limitations. Our method employs a saliency-aware hybrid quantization algorithm combined with a non-uniform model weight partition strategy, significantly reducing computational costs without compromising much overall model performance.en_US
dc.identifierhttps://doi.org/10.13016/berw-jiei
dc.identifier.urihttp://hdl.handle.net/1903/34623
dc.language.isoenen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pqcontrolledRoboticsen_US
dc.subject.pquncontrolledMultimodal Alignmenten_US
dc.subject.pquncontrolledVideo Dataseten_US
dc.subject.pquncontrolledVideo Understandingen_US
dc.subject.pquncontrolledVisual Reasoningen_US
dc.titleTOWARDS EFFECTIVE AND EFFICIENT VIDEO UNDERSTANDINGen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_umd_0117E_25449.pdf
Size:
66.35 MB
Format:
Adobe Portable Document Format