Creating and Evaluating Human-grounded AI Tools for Challenging and Trustworthy Text

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Hassan, Naeemul
Boyd-Graber, Jordan

Citation

Abstract

Natural language processing (NLP) enables computers to understand and interact with human language, and it is increasingly deployed in Artificial Intelligence (AI) applications like chatbots and voice-operated GPS systems. Although NLP models often claim super-human performance in these services, the models often struggle to handle the complexity and variability of real-world data. They typically lack the flexibility in handling vagueness and understanding different contexts as users expect, which limits their reliability in assisting or collaborating with humans in daily lives. This unreliability often stems from models being evaluated primarily on narrow metrics and benchmarks that do not capture the complexities of real-world interactions. As a result, these systems may perform well under controlled conditions but fail in open-ended, ambiguous, or dynamically shifting situations. To address this limitation, it is essential to develop robust evaluation methods and prioritize AI trustworthiness before deployment. Thus, this dissertation introduces a set of human-grounded frameworks that incorporate human input—user responses, annotations, or human-created artifacts—to evaluate and enhance the robustness and trustworthiness of AI systems in real-world deployment settings.

We begin by examining human-grounded approaches that enhance the evaluation methodologies of NLP systems. First, it enriches benchmark datasets with human inputs. It supports the design of benchmark examples that reflect real user queries, increasing realism. Second, it allows us to measure human baseline performance on a specific task—capturing their skill level—which facilitates direct comparison with the model’s performance. Third, it includes subjective human judgments from real-world users, capturing diverse user interpretations of NLP tasks. These dimensions are often overlooked in automated evaluation. Moreover, human-grounded methods can integrate user standards and values directly into the model development process as a reference point to understand the model's intended use from a user perspective. For example, incorporating human-designed criteria that resonate with human standards helps to refine and guide erroneous model behaviors. When models fail, these methods can promote interpretability, allowing users to understand the cause of failures and offer tailored suggestions. Importantly, as models respond to users' suggestions, such dynamic interaction forms a feedback loop: users guide the model, and the model, in turn, refines its outputs based on that user input. Rooted in user norms and values, this loop not only improves model performance but also enhances transparency and controllability by facilitating continuous adjustment between human and machine.

Building on the premise that human-grounded evaluation is essential for user-centric NLP systems, the first two chapters of this dissertation introduce pipelines and metrics to generate challenging benchmark datasets that better reflect real-world complexity. The first chapter introduces a human-in-the-loop (HITL) metric that quantifies the adversarial robustness—how consistently the examples are more difficult for models than humans—of a benchmark. The proposed metric goes beyond standard accuracy or F1 scores by incorporating measures of human difficulty, example ambiguity, and response diversity, thereby capturing aspects of task realism and user perception that are often overlooked in conventional benchmarks. In this measurement process, we account for varying skill levels across expert humans and models while ensuring benchmark examples are well-posed. This metric offers a practical way to track benchmark robustness over time. The second chapter introduces another pipeline to create challenging artifacts that capture natural adversarialness, directly reflecting real-world tasks that are inherently difficult and subjective for models, and even for humans. We designed an annotation scheme that effectively elicits real-world user subjective judgments as labels for training and evaluation. Contemporary model results show a critical gap between the current model capabilities and real-world performance demands.

Expanding beyond robustness evaluation via challenging benchmarks, the final two chapters turn to evaluation of how trustworthy the model is; we focus on model calibration and interpretability. The third chapter aims to tackle human mistrust in AI models by evaluating how well NLP models are calibrated compared to humans, using both humans and models' confidence response data. We propose a HITL benchmark creation pipeline and metric designed to evaluate models' correctness and confidence accounting for human performance. The models were generally more overconfident when they were incorrect, in contrast to humans. Finally, the fourth chapter introduces a user-grounded evaluation framework of multi-agent systems. This enables a granular, user-informed assessment based on user-defined standards, and thereby enhances the transparency and diagnostic clarity of agent behaviors. This framework can make agent failures interpretable and provide actionable feedback for users, encouraging a more controllable and trustworthy interaction.

In sum, these works take an integrated approach to human-grounded evaluation and development of NLP systems. It centers on creating challenging and naturally adversarial datasets, and proposes user-informed metrics and methods to measure model limitations. They contribute to the advancement of more robust, trustworthy, and interpretable language technologies, which ultimately can lead to better alignment with human needs.

Notes

Rights