Towards Generalist Embodied Agents via Representation Learning

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Huang, Furong
Daumé III, Hal

Citation

Abstract

In our dynamic and ever-evolving world, embodied agents for sequential decision-making (SDM) lie at the heart of intelligent behavior in machine learning systems. Just as foundation models in vision and language have revolutionized natural language processing and computer vision through large-scale pretraining, foundation models for SDM hold similar potential by capturing the structure and semantics of decision trajectories. This thesis addresses this challenge from the perspective of representation learning. specifically, how to learn compact yet expressive state and action abstractions that are well suited for downstream policy learning in embodied agents. To this end, it explores both state and action representations and further introduces a surprisingly simple yet effective approach that leverages explicit visual prompting to harness the grounding capabilities of modern vision-language-action (VLA) foundation models, bridging the gap between perception and action.

In the first part of the thesis, a temporal contrastive learning objective, TACO, is proposed for visual representation learning. This method enables the learned embeddings to encode control-relevant dynamics in a compact latent space, significantly improving data efficiency during policy learning. When used for pretraining, these representations allow embodied agents to generalize to novel tasks with minimal expert demonstrations. Building on this idea of future latent prediction, the approach is further scaled to recent large VLA models through FLARE, which augments the standard action prediction objective with a future latent alignment loss. This extension achieves state-of-the-art policy learning performance on multitask benchmarks and enables learning from action-free human video data.

In addition to state representation, the second part of the thesis investigates how temporal action representations can be leveraged for more efficient policy learning through horizon reduction. Inspired by the recent success of large language models (LLMs), I develop PRISE, a simple yet effective framework for learning temporally abstracted action representations. By capturing higher-level temporal structure, this approach shortens the effective planning horizon, substantially improving the performance of multitask imitation learning algorithms and enhancing generalization to unseen tasks with limited demonstrations.

Finally, beyond learning-based state and action representations, the last part of the thesis explores how symbolic representations can further enhance the efficiency of policy learning in embodied agents. Pretrained on diverse internet-scale vision and text data, recent vision- language models (VLMs) possess strong visual and semantic understanding capabilities but struggle to ground their perception in executable 3D robot actions. I demonstrate that symbolic representations, such as visual traces, can help these large VLMs bridge perception and action generation. To this end, I introduce TraceVLA, an explicit visual prompting technique that encodes a robot’s execution history as a symbolic visual trace. This representation provides large VLA models with richer spatio-temporal context, leading to more robust generalization across embodiments and outperforming existing VLA baselines on various real-world robotic manipulation tasks.

Together, these works present a unified framework for world-model-based representationlearning for embodied agents. By jointly advancing state, action, and symbolic abstractions, this thesis takes a step toward scalable foundation models for sequential decision-making, capable of reasoning, acting, and learning across diverse tasks and embodiments.

Notes

Rights