Guided Probabilistic Topic Models for Agenda-setting and Framing

Thumbnail Image


Publication or External Link





Probabilistic topic models are powerful methods to uncover hidden thematic structures in text by projecting each document into a low dimensional space spanned by a set of topics. Given observed text data, topic models infer these hidden structures and use them for data summarization, exploratory analysis, and predictions, which have been applied to a broad range of disciplines.

Politics and political conflicts are often captured in text. Traditional approaches to analyze text in political science and other related fields often require close reading and manual labeling, which is labor-intensive and hinders the use of large-scale collections of text. Recent work, both in computer science and political science, has used automated content analysis methods, especially topic models to substantially reduce the cost of analyzing text at large scale. In this thesis, we follow this approach and develop a series of new probabilistic topic models, guided by additional information associated with the text, to discover and analyze agenda-setting (i.e., what topics people talk about) and framing (i.e., how people talk about those topics), a central research problem in political science, communication, public policy and other related fields.

We first focus on study agendas and agenda control behavior in political debates and other conversations. The model we introduce, Speaker Identity for Topic Segmentation (SITS), is able to discover what topics that are talked about during the debates, when these topics change, and a speaker-specific measure of agenda control. To make the analysis process more effective, we build Argviz, an interactive visualization which leverages SITS's outputs to allow users to quickly grasp the conversational topic dynamics, discover when the topic changes and by whom, and interactively visualize the conversation's details on demand. We then analyze policy agendas in a more general setting of political text. We present the Label to Hierarchy (L2H) model to learn a hierarchy of topics from multi-labeled data, in which each document is tagged with multiple labels. The model captures the dependencies

among labels using an interpretable tree-structured hierarchy, which helps provide insights about the political attentions that policymakers focus on, and how these policy issues relate to each other.

We then go beyond just agenda-setting and expand our focus to framing--the study of how agenda issues are talked about, which can be viewed as second-level agenda-setting. To capture this hierarchical views of agendas and frames, we introduce the Supervised Hierarchical Latent Dirichlet Allocation (SHLDA) model, which jointly captures a collection of documents, each is associated with a continuous response variable such as the ideological position of the document's author on a liberal-conservative spectrum. In the topic hierarchy discovered by SHLDA, higher-level nodes map to more general agenda issues while lower-level nodes map to issue-specific frames. Although qualitative analysis shows that the topic hierarchies learned by SHLDA indeed capture the hierarchical view of agenda-setting and framing motivating the work, interpreting the discovered hierarchy still incurs moderately high cost due to the complex and abstract nature of framing. Motivated by improving the hierarchy, we introduce Hierarchical Ideal Point Topic Model (HIPTM) which jointly models a collection of votes (e.g., congressional roll call votes) and both the text associated with the voters (e.g., members of Congress) and the items (e.g., congressional bills). Customized specifically for capturing the two-level view of agendas and frames, HIPTM learns a two-level hierarchy of topics, in which first-level nodes map to an interpretable policy issue and second-level nodes map to issue-specific frames. In addition, instead of using pre-computed response variable, HIPTM also jointly estimates the ideological positions of voters on multiple interpretable dimensions.