Investigating and Tackling Bias and Hackings in AI Alignment
| dc.contributor.advisor | Duraiswami, Ramani | en_US |
| dc.contributor.author | Chen, Lichang | en_US |
| dc.contributor.department | Computer Science | en_US |
| dc.contributor.publisher | Digital Repository at the University of Maryland | en_US |
| dc.contributor.publisher | University of Maryland (College Park, Md.) | en_US |
| dc.date.accessioned | 2026-01-27T06:31:41Z | |
| dc.date.issued | 2025 | en_US |
| dc.description.abstract | The rapid advancement of large language models has underscored the critical importance of AI alignment.The challenge is ensuring that AI systems operate in accordance with human intentions and values. A central technique for alignment is Reinforcement Learning from Human Feedback~(RLHF), which trains models by optimizing them against a reward signal derived from human preferences. While effective, this paradigm is susceptible to failure modes where models learn to maximize their reward score without genuinely adhering to the desired principles. The first part of this thesis summarizes my investigation of critical vulnerabilities in current alignment methods, focusing on how models exploit unforeseen loopholes in evaluation and training frameworks. My work first demonstrates a pervasive issue in RLHF known as ``reward hacking''. It reveals that prominent reward models and even human evaluators exhibit strong ``format bias'', showing undue preference for superficial cues like lists, bolded text, links, and emojis. The second part of my work extends my inquiry beyond unimodal text generation to the burgeoning field of Omni-modality Language Models (OLMs). To probe the alignment of these more complex systems, we introduce OmnixR, a novel evaluation suite designed to test reasoning across a diverse mix of modalities, including text, audio, images, and video. The evaluation reveals that even state-of-the-art OLMs like GPT-4o and Gemini struggle significantly with tasks that require genuine cross-modal reasoning. These models exhibit unique biases and failure modes when forced to integrate information from multiple sources, indicating that alignment challenges are not only persistent but also evolve in complexity with model capabilities. To address the vulnerabilities identified in RLHF, the third part of this thesis shows a novel method, ODIN, designed to mitigate reward hacking. ODIN tackles the problem by training a two-head reward model that explicitly disentangles content quality from exploitable stylistic features, such as response length. By training one head to correlate with these features and another to be decorrelated from them, we can isolate a purer signal for content quality. During the reinforcement learning phase, the policy is optimized using only the decorrelated, quality-focused reward signal. Our experiments demonstrate that this approach effectively prevents the model from hacking the reward system through verbosity and other stylistic artifacts, resulting in better-aligned models that achieve high performance without resorting to superficial tricks. The last part of the thesis introduces the data filter for filtering the bad data in the supervised finetuning set. This thesis proposes a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce Alpagasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. Alpagasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human study. Its 13B variant matches performance of its teacher LLM (i.e., Text-Davinci-003) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant. In the experiment, we also demonstrate that our method can work not only for machine-generated datasets but also for human-written datasets. Overall, Alpagasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Looking ahead, future work should focus on creating even more challenging and dynamic benchmarks that co-evolve with model capabilities to prevent benchmark overfitting, which can pave the way for more reliable AI systems. | en_US |
| dc.identifier | https://doi.org/10.13016/r22o-9zyn | |
| dc.identifier.uri | http://hdl.handle.net/1903/35013 | |
| dc.language.iso | en | en_US |
| dc.subject.pqcontrolled | Computer science | en_US |
| dc.subject.pquncontrolled | AI Alignment | en_US |
| dc.subject.pquncontrolled | Language Modeling | en_US |
| dc.subject.pquncontrolled | Reinforcement Learning | en_US |
| dc.title | Investigating and Tackling Bias and Hackings in AI Alignment | en_US |
| dc.type | Dissertation | en_US |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Chen_umd_0117E_25588.pdf
- Size:
- 7.68 MB
- Format:
- Adobe Portable Document Format
(RESTRICTED ACCESS)