Translating Natural Language to Visually Grounded Verifiable Plans

dc.contributor.advisorAloimonos, Yiannisen_US
dc.contributor.authorMavrogiannis, Angelosen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-09-15T05:31:47Z
dc.date.issued2025en_US
dc.description.abstractTo be useful in household environments, robots may need to understand natural language in order to parse and execute verbal commands from novice users. This is a challenging problem that requires mapping linguistic constituents to physical entities and at the same time orchestrating an action plan that utilizes these entities to complete a task. Planning problems that previously relied on querying manually crafted knowledge bases can now leverage Large Language Models (LLMs) as a source of commonsense reasoning to map high-level instructions to action plans. However, the produced plans often suffer from model hallucinations, ignore action preconditions, or omit essential intermediate actions under the assumption that users can infer them from context and prior experience. In this thesis, we present our work on translating natural language instructions to visually grounded verifiable plans. First, we motivate the use of classical concepts such as Linear Temporal Logic (LTL) to verify LLM-generated action plans. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable plans. Building on this insight, we present Cook2LTL, a system that receives a cooking recipe in natural language form, reduces high-level cooking actions to robot-executable primitive actions through the use of LLMs, and produces unambiguous task specifications written in the form of Linear Temporal Logic (LTL) formulae. By expressing action plans in a formal language notation that adheres to a set of rules and specifications, we can generate discrete robot controllers with provable performance guarantees. Second, we focus on grounding linguistic instructions to visual sensory information and we find that Vision Language Models (VLMs) often struggle with identifying non-visual attributes. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a Perception-Action API that consists of perceptual and motoric functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Third, we present NL2PDDL2Prog, a system that incorporates the Planning Domain Definition Language (PDDL) as an action representation as a means to combine the ability of LLMs to decompose a high-level task to a set of actions with the correctness of symbolic planning. Prior work has often relied on manually crafting PDDL domains, which can be a difficult and tedious process, especially for non-experts. To circumvent that, we obtain visual observations before and after the execution of an admissible action in our environment. We pass them to a VLM to derive the action semantics which are then sent to an LLM to infer the entire domain. Given the generated domain and an initial visual observation of the scene, the LLM can produce a PDDL problem description that is then solved by a symbolic planner and parsed into an executable python program. By binding the perceptual functions to action preconditions and effects explicitly modeled in the PDDL domain, we visually validate successful action execution at runtime, producing visually grounded verifiable action plans. To demonstrate the applicability of our work in the real world, we design a ROS-powered robotic system capable of receiving natural language instructions and implementing simple cooking recipes on a kitchen counter. We begin by bootstrapping a proof-of-concept system where each object has an ArUco marker on it to facilitate tracking. At runtime, our system receives a natural language instruction, calls Cook2LTL or NL2PDDL2Prog and passes the produced action plan to a python Pick-and-Place API that we developed for recipe execution on a Sawyer robot. We include demonstrations of experiments we conducted on the simple recipe of making a burger using artificial food items in the laboratory. To conclude, we discuss ongoing and future work on improving our existing systems. We plan to incorporate object affordances in the safeguarding formalisms we have used to verify LLM plans. This can be achieved by introducing a more fine-grained action representation to support lower-level primitive actions and produce affordance-aware policies. We also focus on supporting contact-rich manipulation tasks such as grasping delicate and deformable items that are not only ubiquitous in the kitchen but in other domains, too. By leveraging visual context, textual descriptions, and feedback from tactile sensors, we could learn a mapping from the visual and textual space to the amount of current required for compliantly grasping delicate objects. Finally, we are working on extending the tracking functionality of our robotic system by incorporating Deep Object Pose Estimation (DOPE) to track objects of known 3D models from the YCB dataset, without the need of markers.en_US
dc.identifierhttps://doi.org/10.13016/grfc-5iec
dc.identifier.urihttp://hdl.handle.net/1903/34621
dc.language.isoenen_US
dc.subject.pqcontrolledRoboticsen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pqcontrolledLinguisticsen_US
dc.subject.pquncontrolledLarge Language Modelsen_US
dc.subject.pquncontrolledLinear Temporal Logicen_US
dc.subject.pquncontrolledPerception-Action Programsen_US
dc.subject.pquncontrolledPlanning Domain Definition Languageen_US
dc.subject.pquncontrolledRoboticsen_US
dc.subject.pquncontrolledVision Language Modelsen_US
dc.titleTranslating Natural Language to Visually Grounded Verifiable Plansen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mavrogiannis_umd_0117E_25440.pdf
Size:
160.37 MB
Format:
Adobe Portable Document Format