LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL

Huang, Shuaiyi

LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL

dc.contributor.advisor	Shrivastava, Abhinav	en_US
dc.contributor.author	Huang, Shuaiyi	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2025-09-15T05:33:11Z
dc.date.issued	2025	en_US
dc.description.abstract	The rapid growth of visual data and the increasing demand for intelligent robotic systemshave created a pressing need for methods that can establish meaningful correspondences and relationships across diverse visual modalities and robotic tasks. This dissertation addresses the fundamental challenge of learning structured alignment, which involves establishing correspondences between different representations, temporal sequences, and task domains to enable more effective visual understanding and robot control. In the first part of this thesis, we advance visual understanding through three key contributionsthat demonstrate the power of structured alignment in perception tasks. We begin by tackling semantic correspondence, where we propose a teacher-student learning paradigm that enriches supervision from sparse keypoint annotations, enabling dense correspondence learning through spatial priors and loss-driven dynamic label selection. We then address video instance segmentation through two complementary approaches: UVIS, which leverages foundation models (DINO and CLIP) for unsupervised segmentation without dense annotations, and PointVIS, which achieves competitive performance using only point-level supervision through class-agnostic proposal generation and spatio-temporal matching. Finally, we develop Trokens for few-shot action recognition, introducing semantic-aware point correspondence sampling and relational motion alignment that captures both intra-trajectory dynamics through Histogram of Oriented Displacements and inter-trajectory spatial relationships, effectively aligning appearance features with motion patterns through trajectory-based token alignment. While the first part focuses on establishing correspondences within visual data, real-worldapplications require bridging the gap between visual understanding and robot control. In the second part of this thesis, we present two frameworks that demonstrate how structured alignment can be extended to robotic applications. We introduce ARDuP, a novel method for video-based policy learning that aligns generated visual plans with language instructions for effective control. This innovative framework integrates active region (i.e. potential interaction areas) conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. Finally, we present TREND, which addresses robust preference-based reinforcement learning through a tri-teaching framework that filters noisy preference labels while incorporating few-shot expert demonstrations, demonstrating effective alignment between human preferences and robot behaviors even under high noise conditions.	en_US
dc.identifier	https://doi.org/10.13016/wb7d-k9fk
dc.identifier.uri	http://hdl.handle.net/1903/34629
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Alignment	en_US
dc.subject.pquncontrolled	Computer Vision	en_US
dc.subject.pquncontrolled	Deep Learning	en_US
dc.subject.pquncontrolled	Generation	en_US
dc.subject.pquncontrolled	Recognition	en_US
dc.subject.pquncontrolled	Robotics	en_US
dc.title	LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Huang_umd_0117E_25461.pdf
Size:: 31.84 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations