Interpreting Deep Learning Models and Unlocking New Applications With It
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
In recent years, modern deep learning has made significant strides across various domains, including natural language processing, computer vision, and speech recognition. These advancements have been driven by innovations in scaling pre-training data, developing new model architectures, integrating distinct modalities (e.g., vision and language, audio and language), and employing modern engineering practices. However, despite these innovations in building better models, progress in understanding these models to enhance their reliability has been relatively slow. In this thesis, we lay the groundwork for interpreting modern deep learning models—such as vision, text-to-image, and multimodal language models—by examining them through the perspectives of \textbf{data} and \textbf{internal model components}. We aim to unlock various capabilities, including model editing and model steering, to enhance their reliability. First, we build on the principles of robust statistics to interpret test-time predictions by identifying important training examples using higher-order influence functions. However, we find that influence functions can be fragile for large deep models, which limits their practical applications. To address this, we develop optimization-based data selection strategies to automatically generate stress-testing sets from large vision datasets, testing the reliability of vision models within a few-shot learning framework. Overall, our investigations show that while analyzing models through the lens of data provides valuable insights for potential improvements, it does not offer a direct method for controlling and enhancing the reliability of these models. To this end, we investigate deep models by focusing on their internal components. We develop causal mediation analysis methods to understand knowledge storage in text-to-image generative models like Stable Diffusion. Based on these insights, we create novel model editing techniques that can remove copyrighted styles and objects from text-to-image models with minimal weight updates. We scale these methods to edit large open-source models such as SD-XL and DeepFloyd.As a follow-up, we then introduce innovative causal mediation analysis methods and a richly annotated probe dataset to interpret multimodal large language models like LLaVa. Our approach allows us to understand how these models internally retrieve relevant knowledge for factual Visual Question Answering (VQA) tasks. Leveraging these insights, we develop a novel model editing method that can effectively introduce rare, long-tailed knowledge or correct specific failure modes in multimodal large language models. Using similar principles, we explore vision models (in particular the ViT architecture), developing methods to interpret image representations based on internal components such as attention heads, using text descriptions. We apply these interpretability insights to (i) mitigate spurious correlations, (ii) enable zero-shot segmentation, and (iii) facilitate text or image-conditioned image retrieval. We also extend our mechanistic interpretability techniques to understand and control language models for real-world tasks, such as context-augmented generation in question-answering systems (i.e., extractive QA). In particular, we find that insights from mechanistic circuits can be useful towards context-data attribution and model steering towards improved context faithfulness. Finally, we leverage interpretability insights from multimodal models to enhance their compositionality in image-conditioned text retrieval and text-guided image generation. For vision-language models (VLMs) like CLIP, we propose a distillation method that transfers compositional knowledge from diffusion models to CLIP. For diffusion models, we introduce a lightweight fine-tuning approach that learns a linear layer on the conditioning text encoder, improving compositional generation for attribute binding. Overall, our thesis designs and adapts interpretable methods and leverages interpretable insights to uncover various capabilities in pre-trained models.