Towards Trustworthy AI: Methods for Enhancing Robustness and Attribution

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Goldstein, Thomas
Jacobs, David

Citation

Abstract

Current deep learning systems demonstrate remarkable performance across diverse computer vision tasks, ranging from image classification to generative modeling. However, these models remain vulnerable to subtle adversarial manipulations and pose substantial challenges in auditing and interpreting their predictions. In this dissertation, we explore two fundamental challenges crucial for deploying trustworthy AI systems: robustness and attribution.

In the first half, we focus on robustness against adversarial perturbations—small, imperceptible changes that significantly alter model behavior.In the first chapter, we investigate the geometric properties of activation functions on adversarial training, a defense mechanism for imperceptible perturbations. We identify that activations that exhibit low curvature, mitigate overfitting thus improving generalization and mitigating double descent phenomenon. In the second chapter, we investigate the influence of shift-invariance, a critical property of convolution neural networks on adversarial attacks. We prove theoretically for simple datasets that invariance to circular shifts can also lead to greater sensitivity to adversarial attacks. We then empirically verify this for real datasets and realistic architectures, showing shift invariance reduces adversarial robustness
In the third chapter, we propose a new approach to make datasets “unlearnable” by adding imperceptible noise to the training data. Our approach named autoregressive perturbations, is a novel dataset-agnostic poisoning strategy capable of generating imperceptible yet potent data poisoning attacks resistant to adversarial training and strong data augmentations.

Complementing the study of robustness, the second half delves into attribution i.e understanding predictions from the lens of training data. In the fourth chapter, we propose a simple mechanism for understanding how sensitive model's predictions are with respect to the training data. We show that in contrast to prior work which uses a lot of computational power, and a large ensemble of models, a single self-supervised model can serve as a baseline for how the training data influences model predictions.

In the fifth chapter, we focus on memorization, a special case of attribution for generative models. We rigorously examine memorization in diffusion-based text-to-image architectures like Stable Diffusion, quantifying substantial replication of training data in generated outputs. To counteract such memorization, we devise practical interventions effectively reducing copying without compromising the generative quality of these models.

Collectively, this thesis provides novel insights and practical methods, contributing towards the development of more reliable and trustworthy AI systems.

Notes

Rights