ENHANCING ACCESS TO CULTURAL ARCHIVES THROUGH DATA SCIENCE, GENERATIVE AI, AND KNOWLEDGE GRAPHS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Cultural archives hold invaluable historical records, yet outdated cataloging methods and access barriers limit their usability. My dissertation addresses these challenges by integrating data science, generative AI, and knowledge graphs to enhance engagement with Maryland’s Legacy of Slavery (LoS) project collections. My two-pronged strategy focuses on (1) developing computational tools to empower researchers and students to access and perform detailed data analysis on archival datasets independently and (2) leveraging generative AI and knowledge graphs to improve accessibility and contextual analysis. This dissertation synthesizes five interconnected studies conducted between 2021 and the present, structured under three research objectives. Research Objective I supports the first prong, while Objectives II and III collectively support the second.Research Objective I – Empowering Archival Practitioners through Data Science and Computational Thinking: In Study 1 (published, 2021), I used a mixed-method exploratory case study approach to develop interactive Digital Computational Notebooks (iDCNs) as educational tools for archival studies through a step-by-step data science based analysis on one of the LoS datasets. Designed based on a well-established computational thinking (CT) framework, iDCNs integrate Python scripts, narrative explanations, and visualizations to guide students and archivists in cleaning, analyzing, and interpreting LoS datasets. An IRB-approved user survey among students and educators confirmed that iDCNs enhance technical proficiency and critical thinking, making archival data more accessible for independent data analysis. Research Objective II – Designing and Evaluating Generative AI Solutions for Enhanced Access: Addressing the second prong of my strategy, in Study 2 (peer-reviewed, 2023; to be published, 2025), I employed design science and an exploratory case study approach to develop ChatLoS, a chatbot powered by Retrieval-Augmented Generation (RAG) and OpenAI’s GPT models, allowing users to query one of the LoS datasets in natural language. By chunking text for retrieval, ChatLoS preserved contextual relevance and eliminated the need for users to understand data schemas. While it significantly improved accessibility, its reliance on RAG introduced limitations, including ethical concerns, bias, data privacy risks, and constraints that restricted the chatbot’s ability to handle complex, multi-step analytical tasks beyond targeted searches. To address these, in Study 3 (published, 2024), I explored a comparative empirical design in leveraging a generative AI Agent that enables dynamic complex data analysis beyond targeted semantic search retrieval. Findings revealed that while RAG-based retrieval is optimal for targeted semantic search with explainability, an AI agentic approach enhances exploratory analysis, trend identification, and multi-step reasoning also with explainability. However, both studies highlighted concerns about limited scalability across more extensive archival collections, inaccuracies, inconsistencies, usage of proprietary GPT models, including ethical risks, and data control. Research Objective III – Evaluating and Optimizing Generative AI for Ethical and Scalable Archival Access: To address issues identified in Objective II, the first study in this objective focused on completing the development of a Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) enhanced ChatLoS. This system integrates multiple LoS datasets—Certificates of Freedom, Domestic Traffic Advertisements, and Manumissions—into a unified knowledge graph that supports explainable, multi-hop reasoning grounded in archival provenance and explainability. This ChatLoS version was further enhanced to promote transparency by providing source links to scanned archival documents and surfacing internal query logs in a human-readable format. This design enables users to understand how answers are generated and to verify the archival trail. Following this, two systematic user evaluations were conducted using an evaluation rubric grounded in the Activity Theory framework. These evaluations compared the current LoS access systems to the three evolving versions of ChatLoS, revealing how each Gen-AI enhanced iteration progressively mediated user interaction and resolved long-standing usability and interpretability contradictions. This study demonstrated how a KG-RAG enhanced generative AI system can resolve key contradictions in existing access methods by aligning tool behavior with archival principles such as context, trust, and traceability. This also identified opportunities for future work by introducing new contradictions. The second part of this objective evaluated four leading LLMs—GPT-4o, Claude, Llama, and Gemini—across criteria such as security, accuracy, guardrail customization, and multi-user deployment capabilities. Findings identified enterprise-grade models like Azure OpenAI GPT-4o as more appropriate for sensitive archival applications. The dissertation concludes with a synthesis of findings that integrates the results of the three research objectives. It also highlights promising directions for future work, including participatory community studies and interface enhancements to ensure equitable, transparent, and culturally sensitive access to digital archives.