Interpreting Visual Representations and Mitigating their Failures

dc.contributor.advisorFeizi, Soheilen_US
dc.contributor.authorKalibhat, Nehaen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2025-01-25T06:52:09Z
dc.date.available2025-01-25T06:52:09Z
dc.date.issued2024en_US
dc.description.abstractDeep learning has become the cornerstone of artificial intelligence (AI), particularly in language and computer vision domains. The progression in this field is reflected in numerous applications accessible to the general public, such as information retrieval via virtual assistants, content generation, autonomous vehicles, drug discovery, and medical imaging. This unprecedented rate of AI adoption raises the critical need for research on the fundamental underpinnings of deep neural networks to understand what leads to their decisions and why they fail. This thesis concentrates on self-supervised representation learning, a prevalent unsupervised method employed by foundational models to extract patterns from extensive visual data. Specifically, our focus lies in examining the low-dimensional representations generated by these models and dissecting their failure modes. In our initial investigation, we discover that self-supervised representations lack robustness to domain shifts, as they are not explicitly trained to distinguish image content from its domain. We remedy this issue by proposing a module that can be plugged into existing self-supervised baselines to disentangle their representation spaces and promote domain invariance and generalization. Our subsequent analysis delves into the patterns within representations that influence downstream classification. We scrutinize the discriminative capacity of individual features and their activations. We then propose an unsupervised quality metric that can preemptively determine whether a given representation will be correctly or incorrectly classified, with high precision. In the next segment of this thesis, we leverage our findings to further demystify the representation space, by uncovering interpretable subspaces which have unique concepts associated with them. We design a novel explainability framework that uses a vision-language model (such as CLIP) to provide natural language explanations for neural features (or groups) of a given pre-trained model. We next investigate the role of augmentations and format transformations in learning generalizable visual representations. Drawing inspiration from advancements in audio and speech modalities, we examine how presenting visual data in multiple formats affects learning, separating this from the impact of augmentations. In the final segment, we reveal compositionality as a notable failure mode in current state-of-the-art representation methods. We critique the use of fixed-size patches in vision transformers and demonstrate the benefits of employing semantically meaningful patches based on visual priors. This design adjustment leads to significant improvements in image-text retrieval tasks and, more importantly, enhances performance on compositionality benchmarks.en_US
dc.identifier.urihttp://hdl.handle.net/1903/33641
dc.language.isoenen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pquncontrolledComputer Visionen_US
dc.subject.pquncontrolledInterpretabilityen_US
dc.subject.pquncontrolledMulti-Modal Modelsen_US
dc.subject.pquncontrolledRepresentation Learningen_US
dc.subject.pquncontrolledSelf-Supervised Learningen_US
dc.titleInterpreting Visual Representations and Mitigating their Failuresen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Kalibhat_umd_0117E_24749.pdf
Size:
78.61 MB
Format:
Adobe Portable Document Format