Toward Entity-Centric Understanding of Long Documents

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Rudinger, Rachel

Citation

Abstract

Entities and events are the building blocks of language that give the language its richness and expressiveness, be it in everyday conversations, news articles, biographies, legal contracts, or narratives. Documents such as legal contracts and narratives are centered around people entities containing rich information about them (such as the rights and duties of the contracting parties, or emotional states of characters in a book), the events they participate in, and their interactions with each other (such as relationships between characters in a book). All this information makes it challenging for the readers to comprehend these documents and find specific information that they need. Understanding such documents from an entity-centric perspective (i.e., who is participating in an event, what are its attributes, and relationships), as opposed to an event-centric perspective (i.e., what happens), can improve comprehension and extraction of information from these documents enabling the development of practical applications to serve the information needs of readers. Such entity-related information can be conveyed both explicitly and implicitly, and may remain constant or change throughout the document. As a result, certain tasks can be uniquely defined over long contexts. However, limited progress has been made in long document comprehension due to the lack of annotated datasets and challenges in handling extended contexts. This dissertation aims to improve the comprehension of long documents in the narrative and legal domains by focusing on people entities and addressing these challenges.

First, we systematically investigate the presence and accessibility of implicit script knowledge (i.e., structured commonsense knowledge that humans implicitly share in the form of prototypical sequences of events) in pretrained large language models from a protagonist’s perspective via a proposed event sequence description generation task. Based on our findings that these models have limited script knowledge, we propose a script induction framework that is shown to mitigate the issues of mostly omitted, irrelevant, repeated, or misordered events. While scripts are about events that do happen, next, we focus on a setting where multiple entities are involved in an event that may not have happened but is necessary or possible to happen. Taking legal contracts as a test case, we collect a dataset and introduce tasks to identify contracting party-specific obligations, entitlements, prohibitions, and permissions (known as deontic modalities) in lease agreements. We show that transformer-based models trained on this dataset can accurately perform the task demonstrating that the diverse ways of expressing such modalities in natural language are learnable from our dataset. Then, we introduce a task to generate a contracting party-specific extractive summary of the most important obligations, entitlements, and prohibitions in a contract that can help monitor compliance, and aid in the contract reviewing process. We collect a dataset of party-specific importance ordering (implicit information) among sentences belonging to various modalities in a contract and propose a pipeline-based summarization system to handle the data annotation and long context modeling challenge associated with contract-level summary annotation collection and generation task.

Having designed tasks and entity-centric systems that can generate protagonist-oriented prototypical sequence of events that happen in a scenario, and extract explicit and implicit static information related to entities from unstructured text to a structured form in the legal domain, we then present several strategies to assess large language models’ ability to track the fine-grained evolution (dynamic) in social relationship between characters in a book. Based on the finding that these models fall short in their social reasoning capabilities as they tend to rely on surface-level cues and are sensitive to subtle changes in the context, in our final work, we study the influence of character-related attributes such as gender and race on relationship predictions from a conversation between them to find that these models are prone to heteronormativity biases.

Together, this thesis contributes to the growing field of long context understanding by designing new tasks, collecting datasets, and proposing models covering several aspects (explicit or implicit, static or evolving) of entity-related information to improve comprehension of long documents. By addressing the challenges of handling long context and dataset annotations, this thesis aims to foster the development of entity-centric long document understanding systems to serve the information needs of users.

Notes

Rights