ENHANCING TRUSTWORTHINESS AND SAFETY IN FOUNDATION MODELS

Wu, Yihan

ENHANCING TRUSTWORTHINESS AND SAFETY IN FOUNDATION MODELS

Files

Wu_umd_0117E_25625.pdf (4.1 MB)

No. of downloads: 61

Date

2025

Authors

Wu, Yihan

Advisor

Huang, Heng

DRUM DOI

https://doi.org/10.13016/dt7q-os0o

Abstract

The rapid progress of foundation models has driven breakthroughs in computer vision, language, and speech generation. Yet, their widespread deployment also introduces critical challenges of trustworthiness, robustness, and safety. This dissertation advances theoretical foundations and practical techniques to enhance the reliability of foundation models across classification and multi-modal generation tasks.

In the first part, we focus on classification. We introduce RetrievalGuard, the first provably robust method for 1-nearest neighbor image retrieval, ensuring resistance against adversarial manipulation. We further propose adversarial weight perturbation to improve the generalization of graph neural networks under adversarial conditions, and develop a law of robustness beyond Isoperimetry to establish a new theoretical framework for understanding robustness guarantees.

The second part addresses trustworthiness in language-based generation. We design resilient watermarking techniques that preserve the statistical distribution of large language models while ensuring accessibility and detectability, including a distribution-preserving watermark and an unbiased watermark framework. We also study the vulnerabilities of these systems through De-mark, a systematic watermark removal attack, highlighting critical risks and guiding future defense design.

The third part extends trustworthiness to multi-modal generation. We propose watermarking schemes tailored to order-agnostic language models and auto-regressive speech generation models, bridging theoretical guarantees with practical imperceptibility. In particular, we demonstrate robust and distortion-free watermarks for speech generation, marking one of the first principled approaches to secure audio foundation models against misuse.

Together, these contributions form a comprehensive agenda for enhancing the trust- worthiness and safety of foundation models. By unifying robustness theory with practical watermarking, this dissertation provides both provable insights and deployable mechanisms, advancing the development of responsible and reliable AI systems.

URI (handle)

http://hdl.handle.net/1903/35044

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page