DOCUMENT INFORMATION EXTRACTION, STRUCTURE UNDERSTANDING AND MANIPULATION

Mathur, Puneet

DOCUMENT INFORMATION EXTRACTION, STRUCTURE UNDERSTANDING AND MANIPULATION

dc.contributor.advisor	Manocha, Dinesh	en_US
dc.contributor.author	Mathur, Puneet	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2024-02-10T06:42:04Z
dc.date.available	2024-02-10T06:42:04Z
dc.date.issued	2023	en_US
dc.description.abstract	Documents play an increasingly central role in human communications and workplace productivity. Every day, billions of documents are created, consumed, collaborated on, and edited. However, most such interactions are manual or rule-based semi-automated. Learning from semi-structured and unstructured documents is a crucial step in designing intelligent systems that can understand, interpret, and extract information contained in digital PDFs, forms, receipts, contracts, infographics, etc. Our work tries to solve three major problems in the domain of information extraction from real-world multimodal (text+images+layout) documents: (1) multi-hop reasoning between concepts and entities spanning several paragraphs; (2) semi-structured layout extraction in documents consisting of thousands of text tokens and embedded images arranged in specific layouts; (3) hierarchical document representations and the need to transcend content lengths beyond a fixed window for effective semantic reasoning. Our research broadly binds together the semantic (document-level information extraction) and structural (document image analysis) aspects of document intelligence to advance user productivity. The first part of the research addresses issues related to information extraction from characteristically long-range documents that consist of multiple paragraphs and require long-range contextualization. We propose augmenting the capabilities of the Transformer-based methods with graph neural networks to capture local-level context as well as long-range global information to solve document-level information extraction tasks. In this aspect, we first solve the task of document-level temporal relation extraction by leveraging rhetorical discourse features, temporal arguments, and syntactic features through a Gated Relational-GCN model to extend the capability of Transformer architecture for discourse-level modeling. Next, we propose DocTime, a novel temporal dependency graph parsing method that utilizes structural, syntactic, and semantic relations to learn dependency structures over time expressions and event entities in text documents to capture long-range interdependencies. We also show how the temporal dependency graphs can be incorporated into the self-attention layer of Transformer models to improve the downstream tasks of temporal questions answering and temporal NLI. Finally, we present DocInfer - a novel, end-to-end Document-level Natural Language Inference model that builds a hierarchical document graph, performs paragraph pruning, and optimally selects evidence sentences to identify the most important context sentences for a given hypothesis. Our evidence selection mechanism allows it to transcend the input length limitation of modern BERT-like Transformer models while presenting the entire evidence together for inferential reasoning that helps it to reason on large documents where the evidence may be fragmented and located arbitrarily far apart. The second part of the research covers novel approaches for understanding, manipulation, and downstream applications of spatial structures extracted from digital documents. We first propose LayerDoc to extract the hierarchical layout structure in visually rich documents by leveraging visual features, textual semantics, and spatial coordinates along with constraint inference in a bottom-up layer-wise fashion. Next, we propose DocEditor, a Transformer-based localization-aware multimodal (textual, spatial, and visual) model that performs the novel task of language-guided document editing based on user text prompts. Further, we investigated methods for building text-to-speech systems for semi-structured documents. Finally, we will explore two applications of long-context document-level reasoning: (i) user-personalized speech recognition systems for improved next-word prediction in specific domains by utilizing retrieval augmentation techniques for ASR Language Models; (ii) Transformer-based methods to utilize multimodal information from long-form financial conference calls (document-level transcripts, audio-visual recordings, and tabular information) for improved financial time series prediction tasks.	en_US
dc.identifier	https://doi.org/10.13016/dspace/xswm-xrp9
dc.identifier.uri	http://hdl.handle.net/1903/31696
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Artificial intelligence	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	document intelligence	en_US
dc.subject.pquncontrolled	information processing	en_US
dc.subject.pquncontrolled	multimodal AI	en_US
dc.subject.pquncontrolled	natural language processing	en_US
dc.subject.pquncontrolled	visual-linguistic deep learning	en_US
dc.title	DOCUMENT INFORMATION EXTRACTION, STRUCTURE UNDERSTANDING AND MANIPULATION	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mathur_umd_0117E_23836.pdf
Size:: 9.58 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations