Translating Natural Language to Visually Grounded Verifiable Plans

Mavrogiannis, Angelos

Translating Natural Language to Visually Grounded Verifiable Plans

Files

Mavrogiannis_umd_0117E_25440.pdf (160.37 MB)

No. of downloads: 10

Date

2025

Authors

Mavrogiannis, Angelos

Advisor

Aloimonos, Yiannis

DRUM DOI

https://doi.org/10.13016/grfc-5iec

Abstract

To be useful in household environments, robots may need to understand natural language in order to parse and execute verbal commands from novice users. This is a challenging problem that requires mapping linguistic constituents to physical entities and at the same time orchestrating an action plan that utilizes these entities to complete a task. Planning problems that previously relied on querying manually crafted knowledge bases can now leverage Large Language Models (LLMs) as a source of commonsense reasoning to map high-level instructions to action plans. However, the produced plans often suffer from model hallucinations, ignore action preconditions, or omit essential intermediate actions under the assumption that users can infer them from context and prior experience. In this thesis, we present our work on translating natural language instructions to visually grounded verifiable plans.

First, we motivate the use of classical concepts such as Linear Temporal Logic (LTL) to verify LLM-generated action plans. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable plans. Building on this insight, we present Cook2LTL, a system that receives a cooking recipe in natural language form, reduces high-level cooking actions to robot-executable primitive actions through the use of LLMs, and produces unambiguous task specifications written in the form of Linear Temporal Logic (LTL) formulae. By expressing action plans in a formal language notation that adheres to a set of rules and specifications, we can generate discrete robot controllers with provable performance guarantees.

Second, we focus on grounding linguistic instructions to visual sensory information and we find that Vision Language Models (VLMs) often struggle with identifying non-visual attributes. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a Perception-Action API that consists of perceptual and motoric functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image.

Third, we present NL2PDDL2Prog, a system that incorporates the Planning Domain Definition Language (PDDL) as an action representation as a means to combine the ability of LLMs to decompose a high-level task to a set of actions with the correctness of symbolic planning. Prior work has often relied on manually crafting PDDL domains, which can be a difficult and tedious process, especially for non-experts. To circumvent that, we obtain visual observations before and after the execution of an admissible action in our environment. We pass them to a VLM to derive the action semantics which are then sent to an LLM to infer the entire domain. Given the generated domain and an initial visual observation of the scene, the LLM can produce a PDDL problem description that is then solved by a symbolic planner and parsed into an executable python program. By binding the perceptual functions to action preconditions and effects explicitly modeled in the PDDL domain, we visually validate successful action execution at runtime, producing visually grounded verifiable action plans.

To demonstrate the applicability of our work in the real world, we design a ROS-powered robotic system capable of receiving natural language instructions and implementing simple cooking recipes on a kitchen counter. We begin by bootstrapping a proof-of-concept system where each object has an ArUco marker on it to facilitate tracking. At runtime, our system receives a natural language instruction, calls Cook2LTL or NL2PDDL2Prog and passes the produced action plan to a python Pick-and-Place API that we developed for recipe execution on a Sawyer robot. We include demonstrations of experiments we conducted on the simple recipe of making a burger using artificial food items in the laboratory.

To conclude, we discuss ongoing and future work on improving our existing systems. We plan to incorporate object affordances in the safeguarding formalisms we have used to verify LLM plans. This can be achieved by introducing a more fine-grained action representation to support lower-level primitive actions and produce affordance-aware policies. We also focus on supporting contact-rich manipulation tasks such as grasping delicate and deformable items that are not only ubiquitous in the kitchen but in other domains, too. By leveraging visual context, textual descriptions, and feedback from tactile sensors, we could learn a mapping from the visual and textual space to the amount of current required for compliantly grasping delicate objects. Finally, we are working on extending the tracking functionality of our robotic system by incorporating Deep Object Pose Estimation (DOPE) to track objects of known 3D models from the YCB dataset, without the need of markers.

URI (handle)

http://hdl.handle.net/1903/34621

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page