Interpretability of Deep Models Across different Architectures and Modalities

Kazemitabaiezavare, SeyedhamidrezaThe exploration of deep models has been a persistent focus in research over time. Model inversion and feature visualization stand out as significant methods for deciphering the workings of deep models. They play a vital role in unraveling the inner workings of neural architectures, understanding their acquired knowledge, and elucidating their behaviors. This dissertation comprises three chapters dedicated to investigating deep models, particularly newly emerging ones such as Vision Transformers (ViTs) and CLIP, using model inversion and feature visualization techniques. In the first chapter, introducing Plugin Inversion, we introduce a model inversion method that, instead of relying on regularizers, relies on a set of augmentations, sidestepping the need for extensive hyperparameter tuning. We demonstrate the efficacy of our approach by applying it to invert Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Multi-Layer Perceptrons (MLPs). The contributions of this work are summarized as follows: (I) We provide a detailed analysis of various augmentations and how they affect the quality of images produced via class inversion. (II) We introduce Plug-In Inversion (PII), a new class inversion technique based on these augmentations, and compare it to existing techniques. (III) We apply PII to dozens of pre-trained models of varying architecture, justifying our claim that it can be ‘plugged in’ to most networks without modification. (IV) In particular, we show that PII succeeds in inverting ViTs and large MLP-based architectures, which, to our knowledge, has not previously been accomplished. (V) Finally, we explore the potential for combining PII with prior methods. In the second chapter, following PII and the techniques introduced, we utilize model inversion to examine CLIP models. Unlike traditional classification models that are restricted to a predefined set of classes, CLIP models are free of that restriction, and we can apply model inversion to any choice of prompts. Our contributions in this chapter are as follows: (I) In recent years, generative models have shown the capability to blend concepts. We demonstrate that the same holds true for CLIP models, and the knowledge embedded inside CLIP models can blend concepts. (II) We demonstrate that through inversion, seemingly harmless prompts, such as celebrity names, can produce NSFW images. This is particularly true for women celebrities, who the CLIP model seems to strongly associate with sexual content. Certain identities, like “Dakota Johnson,” are close to many NSFW words in the embedding space. This may be problematic since the embeddings of CLIP models are being used in many text-to-image generative models. Addressing this issue requires more meticulous data curation during the training of large-scale models. (III) We demonstrate that CLIP models display gender bias in their knowledge through inversions applied to prompts related to professions and status. (IV) We investigate the scale of the training data on the quality of the inversions, and we show that more training data leads to better inversions. In chapter three, we delve into interpreting Vision Transformers using feature visualization. While feature visualizations have provided valuable insights into the workings of Convolutional Neural Networks (CNNs), these methods have struggled to interpret ViT representations due to their inherent complexity. Nevertheless, we demonstrate that with proper application to the appropriate representations, feature visualizations can be successful with ViTs. This newfound understanding enables us to delve visually into ViTs and the information they extract from images. The contributions of this chapter are as follows: (I) We observe that uninterpretable and adversarial behavior occurs when applying standard methods of feature visualization to the relatively low-dimensional components of transformer-based models, such as keys, queries, or values. However, applying these tools to the relatively high-dimensional features of the position-wise feedforward layer results in successful and informative visualizations. We conduct large-scale visualizations on a wide range of transformer-based vision models, including ViTs, DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method. (II) We show that patch-wise image activation patterns for ViT features essentially behave like saliency maps, highlighting the regions of the image a given feature attends to. This behavior persists even for relatively deep layers, showing the model preserves the positional relationship between patches instead of using them as global information stores. (III) We compare the behavior of ViTs and CNNs, finding that ViTs make better use of background information and rely less on high-frequency, textural attributes. Both types of networks build progressively more complex representations in deeper layers and eventually contain features responsible for detecting distinct objects. (IV) We investigate the effect of natural language supervision with CLIP on the types of features extracted by ViTs. We find CLIP-trained models include various features clearly catered to detecting components of images corresponding to caption text, such as prepositions, adjectives, and conceptual categories.enInterpretability of Deep Models Across different Architectures and ModalitiesDissertationComputer science