KNOWLEDGE DISCOVERY FROM GENE EXPRESSION DATA: NOVEL METHODS FOR SIMILARITY SEARCH, SIGNATURE DETECTION, AND CONFOUNDER CORRECTION
MetadataПоказать полную информацию
Gene expression microarray data is used to answer a variety of scientific questions. For example, it can be used for gaining a better understanding of a drug, segmenting a disease, and predicting an optimal therapeutic response. The amount of gene expression data publicly available is extremely large and continues to grow at an increasing rate. However, this rapid growth of gene expression data from laboratories across the world has not fully achieved its potential impact on the scientific community. This shortcoming is due to the fact that the majority of the data has been gathered under varying conditions, and there is no principled way for combining and fully utilizing related data. Even within a closely controlled gene expression experiment, there are confounding factors that may mask the true signatures when analyzed with current methods. Therefore, we are interested in three core tasks that we believe are important for improving the utilization of gene array data: similarity search, signature detection, and confounder correction. We have developed novel methods that address each of these tasks. In this work, we first address the similarity search problem. More specifically, we propose methods which overcome experimental barriers in pariwise gene expression similarity calculations. We introduce a method, which we refer to as <italic>indirect similarity</italic>, which, unlike previous approaches, uses all of the information in a database to better inform the similarity calculation of a pair of gene expression profiles. We demonstrate that our method is more robust and better able to cope with experimental barriers such as vehicle and batch effects. We evaluate the ability of our method to retrieve compounds with similar therapeutic effects in two independent datasets. We evaluate the recall ability of our approach and show that our method results in an improvement of 97.03% and 49.44% respectively over existing state of the art approaches. The second problem we focus on is signature detection. Gene expression experiments are performed to test a specific hypothesis. Generally, this hypothesis is that there is some genetic signature common in a group of samples. Current methods try to find the differentially expressed genes within a group of samples using a variety of methods, however, they all are parametric. We introduce a nonparametric approach to group profile creation which we refer to as the <italic>Weighted Influence Model - Rank of Ranks</italic> method. For every probe on the microarray, the average rank is calculated across all members of a group. These average ranks are then re-ranked to form the group profile. We demonstrate the ability of our group profile method to better understand a disease and the underlying mechanism common to its treatments. Additionally, we demonstrate the predictive power of this group profile to detect novel drugs that could treat a particular disease. This method leads the detection of robust group signatures even with unknown confounding effects. The final problem that we address is the challenge of removing known (annotated) confounding effects from gene expression profiles. We propose an extension to our non-parametric gene expression profile method to correct for observed confounding effects. This correction is performed on ranked lists directly, and it provides a robust alternative to parametric batch profile correction methods. We evaluate our novel profile subtraction method on two real world datasets, comparing against several state-of-the-art parametric methods. We demonstrate an improvement in group signature detection using our method to remove confounding effects. Additionally, we show that in a dataset with the true group assignments removed and only the confounding effects labelled, our profile subtraction method allows for the discovery of the true groups. We evaluate the robustness of our methods using a gene expression profile generator that we developed.