Expressive Knowledge Resources in Probabilistic Models
Understanding large collections of unstructured documents remains a persistent problem. Users need to understand the themes of a corpus and to explore documents of interest. Topic models are a useful and ubiquitous tool to discover the main themes (namely topics) of the corpus. Topic models have been successfully applied in natural language processing, computer vision, information retrieval, cognitive science, etc. However, the discovered topics are not always meaningful: some topics confuse two or more themes into one topic; two different topics can be near duplicates; and some topics make no sense at all. Adding knowledge resources into topic models can improve the topics. However, how to encode knowledge into topic models and where to find these knowledge resources remain two scientific challenges. To address these problems, this thesis presents tree-based topic models to encode prior knowledge, a mechanism incorporating knowledge from untrained users, a polylingual tree-based topic model based on existing dictionaries as knowledge resources, an exploration of regularizing spectral methods to encode prior knowledge into topic models, and a model for automatically building hierarchies of prior knowledge for topic models. To encode knowledge resources into topic models, we first present tree-based topic models, where correlations between word types are modeled as a prior tree and applied to topic models. We also develop more efficient inference algorithms for tree- based topic models. Experiments on multiple corpora show that efficiency is greatly improved on different number of topics, number of correlations and vocabulary size. Because users decide whether the topics are useful or not, users' feedback is necessary for effective topic modeling. We thus propose a mechanism for giving normal users a voice to topic models by encoding users' feedback as correlations between word types into tree-based topic models. This framework, interactive topic modeling (ITM), allows untrained users to encode their feedback easily and iteratively into the topic models. We validate the framework both with simulated and real users and discuss strategies for improving the user experience to adapt models to what users need. Existing knowledge resources such as dictionaries can also improve the model. We propose polylingual tree-based topic models based on bilingual dictionaries and apply this model to domain adaptation for statistical Machine Translation. We derive three different inference schemes and evaluate the efficacy of our model on a Chinese to English translation system, and obtain up to 1.2 BLEU improvement over the machine translation baseline. This thesis further explores an alternative way--regularizing spectral methods for topic models--to encode prior knowledge into topic models. Spectral methods offer scalable alternatives to Markov chain Monte Carlo and expectation maximization. However, these new methods lack the priors that are associated with probabilistic models. We examine Arora et al.'s anchor algorithm for topic models and encode prior knowledge by regularizing the anchor algorithm to improve the interpretability and generalizability of topic models. Because existing knowledge resources are limited and because obtaining the knowledge from users is expensive and time-consuming, automatic techniques should also be considered to extract knowledge from the corpus. This thesis further presents a Bayesian hierarchical clustering technique with the Beta coalescent, which provides a possible way to build up the prior tree automatically. Because of its computational complexity, we develop new sampling schemes using sequential Monte carlo and Dirichlet process mixture models, which render the inference practical and efficient. This thesis explores sources of prior knowledge, presents different ways to encode these expressive knowledge resources into probabilistic topic models, and also applies these models in translation domain adaptation. We also discuss further extensions in a bigger picture of interactive machine learning techniques and domain adaptation for downstream tasks.