Augmenting Communication in Extended Reality through Multimodal Systems for Humans and Agents

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Manocha, Dinesh

Citation

Abstract

Extended Reality (XR) and agentic systems are creating new paradigms for communication. However, these technologies introduce a critical socio-technical gap by disrupting the fundamental mechanisms humans use to establish and maintain common ground. This dissertation defines Attentional Friction as the cognitive and interactive effort required to bridge the "grounding gap" in mediated environments. This friction manifests as high cognitive load, costly attentional switching, and breakdowns in social coordination. This thesis argues that this friction can be systematically mitigated through multimodal systems that infer user attention and augment the environment with the necessary grounding cues.

This dissertation introduces the concept of Attentional Friction, decomposed into three primary domains of interaction. First, we address Informational Friction, the effort required to ground with information. We demonstrate that by designing gaze-based reading interactions and a spatial, collaborative document decomposition interface, we can utilize gaze as a low-effort input and make task state passively visible, which significantly reduces perceived cognitive load and enhances collaborator awareness. Second, we address Interpersonal Friction, the effort of grounding with other humans when social cues are lost. We contribute systems that restore peripheral awareness for turn-taking using multimodal cues and that provide context-aware, adaptive transcriptions for re-engagement. Our findings show these systems measurably reduce the "costs of grounding", quantified through objective metrics including gaze-based social engagement, information recall, and behavioral response times. Finally, we address Agent-Mediated Friction, the effort of grounding with non-human agents. We present a framework for unobtrusive proactive agents that use multimodal context to infer appropriate presentation of queries. We also contribute an agentic presentation system that uses speech and gesture to align a presenter's video with their content dynamically. These works show that agents can be shifted from sources of interruption to attention-aware mediators, reducing perceived effort.

Notes

Rights