LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL

dc.contributor.advisorShrivastava, Abhinaven_US
dc.contributor.authorHuang, Shuaiyien_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-09-15T05:33:11Z
dc.date.issued2025en_US
dc.description.abstractThe rapid growth of visual data and the increasing demand for intelligent robotic systemshave created a pressing need for methods that can establish meaningful correspondences and relationships across diverse visual modalities and robotic tasks. This dissertation addresses the fundamental challenge of learning structured alignment, which involves establishing correspondences between different representations, temporal sequences, and task domains to enable more effective visual understanding and robot control. In the first part of this thesis, we advance visual understanding through three key contributionsthat demonstrate the power of structured alignment in perception tasks. We begin by tackling semantic correspondence, where we propose a teacher-student learning paradigm that enriches supervision from sparse keypoint annotations, enabling dense correspondence learning through spatial priors and loss-driven dynamic label selection. We then address video instance segmentation through two complementary approaches: UVIS, which leverages foundation models (DINO and CLIP) for unsupervised segmentation without dense annotations, and PointVIS, which achieves competitive performance using only point-level supervision through class-agnostic proposal generation and spatio-temporal matching. Finally, we develop Trokens for few-shot action recognition, introducing semantic-aware point correspondence sampling and relational motion alignment that captures both intra-trajectory dynamics through Histogram of Oriented Displacements and inter-trajectory spatial relationships, effectively aligning appearance features with motion patterns through trajectory-based token alignment. While the first part focuses on establishing correspondences within visual data, real-worldapplications require bridging the gap between visual understanding and robot control. In the second part of this thesis, we present two frameworks that demonstrate how structured alignment can be extended to robotic applications. We introduce ARDuP, a novel method for video-based policy learning that aligns generated visual plans with language instructions for effective control. This innovative framework integrates active region (i.e. potential interaction areas) conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. Finally, we present TREND, which addresses robust preference-based reinforcement learning through a tri-teaching framework that filters noisy preference labels while incorporating few-shot expert demonstrations, demonstrating effective alignment between human preferences and robot behaviors even under high noise conditions.en_US
dc.identifierhttps://doi.org/10.13016/wb7d-k9fk
dc.identifier.urihttp://hdl.handle.net/1903/34629
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledAlignmenten_US
dc.subject.pquncontrolledComputer Visionen_US
dc.subject.pquncontrolledDeep Learningen_US
dc.subject.pquncontrolledGenerationen_US
dc.subject.pquncontrolledRecognitionen_US
dc.subject.pquncontrolledRoboticsen_US
dc.titleLEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROLen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Huang_umd_0117E_25461.pdf
Size:
31.84 MB
Format:
Adobe Portable Document Format