AUGMENTING THE ARCHIVIST: TAKING LIBRARIES AND ARCHIVES INTO THE FUTURE WITH ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
The digital tsunami is upon us - digitization projects are no longer small-scale boutique workings and the rapid approach of primarily born-digital archiving necessitates a fundamental shift in the research questions, tools, and processes of LAM professionals (Libraries, Archives, and Museums). Machine Learning (ML)/Artificial Intelligence (AI) will be a key component of the answer. The urgency of investigating this space is reflected in a number of recently launched research initiatives, both in academia, such as the InterPARES Trust AI (an international research project aiming to design, develop, and leverage AI to support the ongoing availability and accessibility of trustworthy public records: 2021-26), and with emerging research monographs such as Archives, Access and Artificial Intelligence (Lise Jaillant (Ed.), Bielefeld University Press, May 27, 2022), and also in cultural institutions themselves, such as at the National Archives and Records Administration (NARA) (where research into self-describing records has been suggested in order to explore how to automatically produce descriptive metadata for public access with minimal archivist intervention, and where for the first time, cultural archival big datasets have been shared with the public in order to enable AI/ML explorations).
This dissertation is an important contribution to these pressing emerging challenges and the need to develop a new research agenda based on cutting-edge crossover work at the intersection of archival and data science. We develop real-world case studies whose contexts typify this field on a precipice and will allow us to advance solutions in this new research space. The dissertation is organized around three published papers. The first study, comprises a review of literature in the Computational Archival Science research space as well as offering important research contributions in terms of characterizing fundamental changes observed in the field as archives only begin to take deposits from the digital workforce. The analysis uses data science and Large Language Model (LLM) methods. The second study, based on a library collection from OCLC, explores the case of a massive consolidated catalog with challenges related to the cleaning of multilingual MARC records at the million-records scale. We develop and prototype two approaches – an innovative ML model that predicts the likelihood of errors in new ingests, and an algorithm that produces relevant statistical language probabilities to identify and repair transliteration errors. The effectiveness of these two methodologies is evaluated against established metrics. The third study, based on an archive collection from Spelman College, explores the case of a small university archive with a highly engaged alumni community pushing for greater access to and details about the archives’ photograph collections. We investigate the potential development of automated workflows to convert Dublin Core metadata to Archival Linked Data. We further assess the potential effectiveness of these types of automation pipelines and suggest new approaches for metadata extraction and linking of cultural resources.These papers, each grounded in their specific contexts, in aggregation suggest a new research agenda that addresses the need for libraries and archives to adapt to the massive shifts in scale, speed, accuracy, diversity, connectivity, rapid engagement and co-creation that are demanded by the challenges of the Digital age.