Uncovering, Understanding, and Mitigating Social Biases in Language Models
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Social science research has highlighted stereotypes and discrimination based on race/ethnicity or gender, often inferred from names, exacerbating social inequality. Contemporary natural language processing (NLP) systems, including large language models (LLMs), trained on extensive but potentially biased corpora may inadvertently perpetuate these biases. This dissertation aims to uncover, understand, and mitigate social biases in NLP systems, particularly through the lens of analyzing biases related to first names.
The concept of counterfactual fairness serves as a guiding principle, where model predictions should ideally remain consistent despite name substitutions that preserve the original semantic meaning. We leverage name substitution to investigate biases in NLP systems, offering several advantages. First, the automatic generation of diverse instances through name substitution streamlines bias detection without requiring manual data creation. Second, examining model behavior across different first names in an open-ended space can reveal biases that are not captured by pre-defined diagnostic tests. Lastly, using first names for bias identification aligns with real-world applications concerned with individual fairness.
Using name substitution as our primary technique, we study three types of biases in NLP systems: stereotypes about personal attributes, occupational biases, and biases about romantic relationships. Stereotypes about personal attributes emerge when a model infers someone's personality from inputs describing social interactions. Occupational biases encompass both hiring discrimination and gender-occupation stereotypes. Romantic relationship biases include heteronormative assumptions and prejudice against interracial couples.
To study stereotypes about personal attributes, we introduce a framework that uncovers model biases in social commonsense reasoning tasks and show that both demographic associations and tokenization artifacts contribute to observed disparities. For occupational biases, we demonstrate that LLMs can exhibit discriminatory patterns in simulated hiring tasks and stereotypically associate gendered names with gender-dominated professions. We further analyze these patterns by studying the contextualized embeddings and propose a consistency-guided finetuning method to mitigate such biases. Finally, in the domain of romantic relationship prediction from conversations, we find evidence of heteronormative bias and underprediction of romantic relationships for couples involving Asian names.
In sum, these contributions offer a comprehensive examination of first-name-based biases in language models, providing insights into their underlying mechanisms, and present actionable mitigation strategies. This work takes a step toward developing fairer, more interpretable, and more inclusive language technologies.