Thumbnail Image


Publication or External Link





In many settings, communicating in a language requires making choices among different possibilities — the issues to focus on, the aspects to highlight within any issue, the narratives to include, and more. These choices, deliberate or not, aresocially structured. The ever-increasing availability of unstructured large-scale textual data, in part due to the bulk of communication and information dissemination happening in online or digital spaces, makes natural language processing (NLP) techniques a natural fit for helping understand socially-situated communicative choices using that textual data. Within NLP methods, unsupervised NLP methods are often needed since digital large-scale textual data in the wild is often available without accompanying labels, and any existing labels or categorization might not be appropriate for answering specific research questions.

This dissertation seeks to address the following question: how can we use unsupervised NLP methods to study texts authored by specific people or institutions in order to effectively explicate the communicative choices being made, as well as to investigate their potential motivations, context-based variation, and consequences?

Our first set of contributions centers on methodological innovation. We focus on topic modeling: a class of generally unsupervised NLP methods that can automatically discover authors’ communicative choices in the form of topics or categorical themes present in a collection of documents. We introduce a new neural topic model (NTM) that effectively incorporates contextualizing sequential knowledge. Next, we find critical gaps in the near-universal automated evaluation paradigm that compares different models in the topic modeling methods literature, which calls into question much of the recent work in NTM development claiming “state-of-the-art” and emphasizes the importance of validating the outputs of unsupervised NLP methods.

In order to use unsupervised NLP methods to investigate potential motivations, context-based variation, and consequences of communicative choices, we link textual data with information about the authors, social contexts, and media involved in their production — these connected information sources help us conduct empirical research in social sciences.

In our second set of contributions, we analyze a previously unexplored connection between a politician’s donors and their communicative choices in their floor speeches to show how donations influence issue-attention in US Congress, enabling a new look at money in politics and providing an example of studying motivations behind communicative choices.

Our third set of contributions uses text-based ideal points to better understand the role of institutional constraints and audience considerations in the varying expression and ideological positioning of politicians. The application of this tool for expanding knowledge of legislative politics is enabled by comprehensive annotations for modeling outputs provided by domain experts in order to establish the tool’s validity and reliability.

In our fourth set of contributions, we demonstrate the potential of both unsupervised NLP techniques and social network data and methods in better understanding the downstream consequences of communicative choices. We focus on misinformation narratives in mainstream media, viewing and highlighting misinformation as something that goes beyond just false claims published by certain bad actors or stories published by certain ‘fake news’ outlets. Our findings suggest a strategic repurposing of mainstream news by conveyors of misinformation as a way to enhance the reach and persuasiveness of misleading narratives.