Bridging the Gulf of Evaluation in Human-AI Interaction for Knowledge Workers

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Liu, Zhicheng

Citation

Abstract

For over half a century, user interfaces have served as the primary medium through which humans interact with software systems. To describe this interaction, researchers~\cite{hutchins1985direct, norman1986user} introduced a seven-stage action model encompassing goal formation, intention, action specification, execution, system response, interpretation, and evaluation. Central to this model are two critical challenges—referred to as the Gulf of Execution and the Gulf of Evaluation. The Gulf of Execution represents the gap between a user’s goal and the means available to achieve it within the system, while the Gulf of Evaluation describes the gap between the system’s perceived state and the user’s goals.

The emergence of AI-powered interfaces has reshaped this interaction landscape. Unlike traditional deterministic systems, AI-powered interfaces often exhibit dynamic and unpredictable behaviors~\cite{amershi2019guidelines}, prompting a re-examination of these Gulfs. On the one hand, the Gulf of Execution has narrowed: users can now articulate goals through natural language commands, leveraging large language model (LLM)-powered applications, rather than manually navigating complex menus~\cite{jiang2022discovering, wu2022ai}. On the other hand, the Gulf of Evaluation has widened: AI-generated outputs can be inaccurate or untrustworthy. For example, object detectors may incorrectly classify pedestrians on the road~\cite{hoiem2012diagnosing, simhambhatla2019self}, and LLMs have both intrinsic (contradicted by the source) and extrinsic (not supported by the source) hallucinations~\cite{ji2023survey, liu2023trustworthy}.

The burden of evaluation largely falls on human knowledge workers, who apply knowledge to engage in non-routine problem solving and develop products and services~\cite{janz1997knowledge}. Recent studies on knowledge workers have shown a wide adoption of AI and the practice of evaluating AI-generated results before use~\cite {woodruff2024knowledge, mckinseySurvey}. Therefore, it's important to develop human-centered AI systems to bridge the Gulf of Evaluation for knowledge workers. To bridge the Gulf of Evaluation in the above two dimensions-inaccuracies and lack of trust-we must re-imagine human-centered AI systems beyond chatbots and design novel human-AI interactions. Therefore, the overarching research question of the thesis is: \textit{How do we design human-centered AI systems to bridge the Gulf of Evaluation in human-AI interaction for knowledge workers?}

The thesis addresses this research question by introducing techniques to reduce inaccuracies and foster trust for representative knowledge workers. More specifically, to reduce inaccuracies, we developed \textsc{TutoAI}~\cite{chen2024tutoai}, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. We present an approach to identifying, assembling, and evaluating AI models for creating mixed-media tutorials from instructional videos, along with an interface for creators to refine AI-generated components. To enhance trust, we developed an AI-assisted visual analytics tool called \textsc{COALA}~\cite{chen2025comparing} for a multilingual collaborative writing dataset. We contribute several interpretable techniques, including interactive clustering, textual pattern explanations and dedicated data visualizations to foster trust among communication researchers. To thoroughly evaluate machine learning models and build trust before deploying them in high-risk applications, we developed \textsc{Safeguard AI} for safety experts —a visual analytics tool powered by AI agents that reveals model inaccuracies and ensures regulatory compliance. Collectively, these systems highlight how human-centered techniques can effectively bridge the Gulf of Evaluation for diverse knowledge workers.

Notes

Rights