Trust through Transparency: Towards Reliable AI for All
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Seemingly performant models can break down in unexpected and uneven ways, from image classifiers failing to recognize an otter out of water, to LLMs being nearly 3 times worse at recalling facts about Somalia than Sweden. In this dissertation, I’ll detail interpretability techniques to scalably illuminate and efficiently intervene on discovered model deficiencies. First, I’ll present evidence for pervasive reliance on spurious correlations by vision models, by way of carefully constructed benchmarks. Then, I’ll automate these approaches, demonstrating the power of leveraging auxiliary models to more efficiently organize data, towards uncovering and articulating subsets where models struggle. Finally, I’ll show how these same techniques can be applied to mitigate instances of real-world geographic disparities and even tackle sociotechnical challenges like artistic copyright infringement. In general, it can be difficult to trust what we do not fully understand, especially when unexpected failures arise. By scalably identifying failure modes before they cause harm, we enhance transparency around model abilities and limitations, thus better informing when models can be trusted to work reliably for all.