The Many Faces of Generalization: from Traditional ML to LLM Safety
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Ensuring the trustworthiness of machine learning systems requires strong generalization not only under distribution shifts but also under adversarial manipulations. This dissertation advances our understanding of trustworthy machine learning from two perspectives: improving generalization through model invariance and equivariance, and improving safety alignment in large language models (LLMs).
The first part of the dissertation investigates how structural priors like invariance can enhance generalization in the vision domain. We introduce the notion of sample cover induced by transformations, which theoretically quantifies the effectiveness of data augmentations and empirically guides data augmentation selection as a data-dependent metric. To tackle robustness against unforeseen data variations, we propose an equivariant domain translation framework that leverages out-of-distribution data to learn unforeseen robustness. Lastly, we draw inspiration from human perception and propose PerceptionCLIP, a two-step method that improves zero-shot image classification by inferring and conditioning on contextual attributes that the model should be invariant to.
The second part focuses on the safety alignment of LLMs. We present AutoDAN, a white-box attack that generates interpretable and transferable jailbreak prompts through gradient-guided text generation, revealing emergent attack strategies. We also use the same framework to automatically generate pseudo-harmful prompts for red-teaming false refusals in safety-aligned LLMs, uncovering trade-offs between safety and usability. Lastly, we introduce AdvPrefix, a prefix-forcing objective that enables more nuanced and effective jailbreaks by automatic model-dependent prefix selection.
This thesis demonstrates that generalization remains a central challenge across both traditional ML and modern LLMs. Together, these contributions provide practical tools and empirical findings for building machine learning systems that generalize better under distribution shifts and adversarial conditions.