ENHANCING TRUSTWORTHINESS AND SAFETY IN FOUNDATION MODELS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
The rapid progress of foundation models has driven breakthroughs in computer vision, language, and speech generation. Yet, their widespread deployment also introduces critical challenges of trustworthiness, robustness, and safety. This dissertation advances theoretical foundations and practical techniques to enhance the reliability of foundation models across classification and multi-modal generation tasks.
In the first part, we focus on classification. We introduce RetrievalGuard, the first provably robust method for 1-nearest neighbor image retrieval, ensuring resistance against adversarial manipulation. We further propose adversarial weight perturbation to improve the generalization of graph neural networks under adversarial conditions, and develop a law of robustness beyond Isoperimetry to establish a new theoretical framework for understanding robustness guarantees.
The second part addresses trustworthiness in language-based generation. We design resilient watermarking techniques that preserve the statistical distribution of large language models while ensuring accessibility and detectability, including a distribution-preserving watermark and an unbiased watermark framework. We also study the vulnerabilities of these systems through De-mark, a systematic watermark removal attack, highlighting critical risks and guiding future defense design.
The third part extends trustworthiness to multi-modal generation. We propose watermarking schemes tailored to order-agnostic language models and auto-regressive speech generation models, bridging theoretical guarantees with practical imperceptibility. In particular, we demonstrate robust and distortion-free watermarks for speech generation, marking one of the first principled approaches to secure audio foundation models against misuse.
Together, these contributions form a comprehensive agenda for enhancing the trust- worthiness and safety of foundation models. By unifying robustness theory with practical watermarking, this dissertation provides both provable insights and deployable mechanisms, advancing the development of responsible and reliable AI systems.