Advancing Object Understanding in Large Language Models
Files
(RESTRICTED ACCESS)
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) with their ability to reason in a wide variety of real-world scenarios. LLMs also have strong performance in generating coherent natural language and communicating with humans. These abilities make an LLM the preferred tool for human-robot communication, where the LLM can take a natural language instruction and transform it into information for the robot to plan a course of action. However, LLMs are black-box models where their decision-making processes are obscure. These models are also text-based, which limits the LLM's understanding of the physical world. When it comes to high-stakes scenarios where the robot could be deployed, like natural disasters, an LLM deployed on a robot becomes a risky proposition.
A major component of a robot's ability to execute instructions is as LLM's ability to reason about objects and their affordances, or functions. The LLM must be able to understand its environment and how a person or robot can manipulate it to accomplish a goal. However, we lacked both a framework to describe object affordances as well as an evaluation of how capable an LLM is at reasoning about objects. We thus develop an Affordance Ontology where objects are mapped to their role in an event using PropBank, a semantic corpus of words annotated with labels of their role in a given event, or sense. Using the Affordance Ontology, we develop 800 sentences for 2 evaluation tasks where an LLM fills in the blank with an object in a sentence describing an event. In task 1, the LLM chooses between objects with different affordances/PropBank Labels, while in task 2 the LLM chooses between objects with the same affordances, but different physical characteristics that prevent the incorrect choices from being able to work for the given event. We evaluate the Masked Language Model (MLM) DistilBERT and the decoder LM Ministral 8B and find that both models perform well for both tasks, but performance degrades when incorrect answer choices are closer to the correct answer in embedding space.
Having established a strong baseline of LLM object affordances, we tackle the problem of implementing object reasoning on LLMs for specific disasters. We first address the lack of data available for fine-tuning a model by developing PropBank Powered Data Creation (PPDC). PPDC is a pipeline for expanding our Affordance Ontology with objects pertinent to a given disaster and using the ontology to fill in templates to create custom seed datasets. These seed datasets can be used for few-shot prompting of LLMs to generate larger synthetic datasets of higher quality. With this data and our use case, we attend to the issue of needing our LLM to run locally on a robot (or another device with constrained computational resources). To do this, we fine-tuned a series of smaller LLMs on our synthetic dataset, creating the Field Ready Instruction Decoding Agents (FRIDA) suite of models. We find that FRIDA models trained on the entire synthetic dataset outperformed their models' baselines. Furthermore, ablated FRIDA models trained only on subsets of the data relating to general object reasoning outperformed their counterparts trained on the complete dataset.
Our method for producing FRIDA models is best at general object reasoning and lags in reasoning about specific objects required for a disaster-related task. We also continue to lack a comparison between the synthetic data resulting from PPDC and human-made data. Because of these issues, we turn to the proxy of fantasy role-playing to understand how different model architectures and data-insertion methods affect LLM reasoning about specific objects. We refine our synthetic data generation process and run a series of experiments to examine the effects of fine-tuning and Retrieval Augmented Generation (RAG) using a new fantasy synthetic dataset and dialogues from the LIGHT dataset\cite{LIGHT}. We find that the synthetic data is more effective in both fine-tuning and RAG, while the man-made data can provide additional improvements to models fine-tuned in our task. We find that for our best performing models used a combination of RAG and fine-tuning from both data sources, and that reasoning about complex objects improved. An open challenge remains in LLM reasoning beyond standard object functionality.