LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL
| dc.contributor.advisor | Shrivastava, Abhinav | en_US |
| dc.contributor.author | Huang, Shuaiyi | en_US |
| dc.contributor.department | Computer Science | en_US |
| dc.contributor.publisher | Digital Repository at the University of Maryland | en_US |
| dc.contributor.publisher | University of Maryland (College Park, Md.) | en_US |
| dc.date.accessioned | 2025-09-15T05:33:11Z | |
| dc.date.issued | 2025 | en_US |
| dc.description.abstract | The rapid growth of visual data and the increasing demand for intelligent robotic systemshave created a pressing need for methods that can establish meaningful correspondences and relationships across diverse visual modalities and robotic tasks. This dissertation addresses the fundamental challenge of learning structured alignment, which involves establishing correspondences between different representations, temporal sequences, and task domains to enable more effective visual understanding and robot control. In the first part of this thesis, we advance visual understanding through three key contributionsthat demonstrate the power of structured alignment in perception tasks. We begin by tackling semantic correspondence, where we propose a teacher-student learning paradigm that enriches supervision from sparse keypoint annotations, enabling dense correspondence learning through spatial priors and loss-driven dynamic label selection. We then address video instance segmentation through two complementary approaches: UVIS, which leverages foundation models (DINO and CLIP) for unsupervised segmentation without dense annotations, and PointVIS, which achieves competitive performance using only point-level supervision through class-agnostic proposal generation and spatio-temporal matching. Finally, we develop Trokens for few-shot action recognition, introducing semantic-aware point correspondence sampling and relational motion alignment that captures both intra-trajectory dynamics through Histogram of Oriented Displacements and inter-trajectory spatial relationships, effectively aligning appearance features with motion patterns through trajectory-based token alignment. While the first part focuses on establishing correspondences within visual data, real-worldapplications require bridging the gap between visual understanding and robot control. In the second part of this thesis, we present two frameworks that demonstrate how structured alignment can be extended to robotic applications. We introduce ARDuP, a novel method for video-based policy learning that aligns generated visual plans with language instructions for effective control. This innovative framework integrates active region (i.e. potential interaction areas) conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. Finally, we present TREND, which addresses robust preference-based reinforcement learning through a tri-teaching framework that filters noisy preference labels while incorporating few-shot expert demonstrations, demonstrating effective alignment between human preferences and robot behaviors even under high noise conditions. | en_US |
| dc.identifier | https://doi.org/10.13016/wb7d-k9fk | |
| dc.identifier.uri | http://hdl.handle.net/1903/34629 | |
| dc.language.iso | en | en_US |
| dc.subject.pqcontrolled | Computer science | en_US |
| dc.subject.pquncontrolled | Alignment | en_US |
| dc.subject.pquncontrolled | Computer Vision | en_US |
| dc.subject.pquncontrolled | Deep Learning | en_US |
| dc.subject.pquncontrolled | Generation | en_US |
| dc.subject.pquncontrolled | Recognition | en_US |
| dc.subject.pquncontrolled | Robotics | en_US |
| dc.title | LEARNING STRUCTURED ALIGNMENT: FROM VISUAL UNDERSTANDING TO ROBOT CONTROL | en_US |
| dc.type | Dissertation | en_US |
Files
Original bundle
1 - 1 of 1