Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    Fantastic Sources Of Tumor Heterogeneity And How To Characterize Them
    (2021) Patkar, Sushant A; Ruppin, Eytan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cancer constantly evolves to evade the host immune system and resist different treatments. As a consequence, we see a wide range of inter and intra-tumor heterogeneity. In this PhD thesis, we present a collection of computational methods that characterize this heterogeneity from diverse perspectives. First, we developed computational frameworks for predicting functional re-wiring events in cancer and imputing the functional effects of protein-protein interactions given genome-wide transcriptomics and genetic perturbation data. Second, we developed a computational framework to characterize intra-tumor genetic heterogeneity in melanoma from bulk sequencing data and study its effects on the host immune response and patient survival independently of the overall mutation burden. Third, we analyzed publicly available genome-wide copy number, expression and methylation data of distinct cancer types and their normal tissues of origin to systematically uncover factors driving the acquisition of cancer type-specific chromosomal aneuploidies. Lastly, we developed a new computational tool: CODEFACS (COnfident Deconvolution For All Cell Subsets) to dissect the cellular heterogeneity of each patient’s tumor microenvironment (TME) from bulk RNA sequencing data, and LIRICS (LIgand Receptor Interactions between Cell Subsets): a supporting statistical framework to discover clinically relevant cellular immune crosstalk. Taken together, the methods presented in this thesis offer a way to study tumor heterogeneity in large patient cohorts using widely available bulk sequencing data and obtain new insights on tumor progression.
  • Thumbnail Image
    Item
    CHARACTERIZATION OF SURVIVAL ASSOCIATED GENE INTERACTIONS AND LYMPHOCYTE HETEROGENEITY IN CANCER
    (2019) Magen, Assaf; Hannenhalli, Sridhar; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cancer is the second leading cause of death globally. Tumors form intricate ecosystems in which malignant and immune cells interact to shape disease progression. Yet, the molecular underpinnings of tumorigenesis and immunological responses to tumors are poorly understood, limiting their manipulation to elicit favorable clinical outcomes. This thesis lays conceptual frameworks for investigating the molecular interactions taking place in tumors as well as the diversity of the immune response to cancer. In the molecular level of individual cancer cells, the phenotypic effect of perturbing a gene’s activity depends on the activity level of other genes, reflecting the notion that phenotypes are emergent properties of a network of functionally interacting genes. In the context of cancer, contemporary investigations have primarily focused on just one type of functional genetic interaction (GI) – synthetic lethality (SL). However, there may be additional types of GIs whose systematic identification would enrich the molecular and functional characterization of cancer. This thesis describes a novel data-driven approach called EnGIne, that applied to large-scale cancer data identifies 71,946 GIs spanning 12 distinct types, only a small minority of which are SLs. The detected GIs explain cancer driver genes’ tissue- specificity and differences in patients’ response to drugs, and stratify breast cancer tumors into refined subtypes. These results expand the scope of cancer GIs and lay a conceptual and computational basis for future studies of additional types of GIs and their translational applications. Furthermore, tumor growth is continuously shaped by the immune response. However, T cells typically adopt a dysfunctional phenotype may be reversed using immunotherapy strategies. Most current tumor immunotherapies leverage cytotoxic CD8+ T cells to elicit an effective anti-tumor response. Despite evidence for clinical potential of CD4+ tumor-infiltrating lymphocytes (TILs), their functional diversity has limited our ability to harness their anti-tumor activity. To address this issue, we have used single-cell mRNA sequencing (scRNAseq) to analyze the response of CD4+ T cells specific for a defined recombinant tumor antigen, both in the tumor microenvironment and draining lymph nodes (dLN). New computational approaches to characterize subpopulations identified TIL transcriptomic patterns strikingly distinct from those elicited by responses to infection, and dominated by diversity among T-bet-expressing T helper type 1 (Th1)-like cells. In contrast, the dLN response includes Follicular helper (Tfh)-like cells but lacks Th1 cells. We identify an interferon-driven signature in Th1-like TILs, and show that it is found in human liver cancer and melanoma, in which it is negatively associated with response to checkpoint therapy. Our study unveils unsuspected differences between tumor and virus CD4+ T cell responses, and provides a proof-of-concept methodology to characterize tumor- control CD4+ T cell effector programs. Targeting these programs should help improve immunotherapy strategies.
  • Thumbnail Image
    Item
    GENOME ASSEMBLY AND VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES AND GRAPH BASED METHODS
    (2018) Ghurye, Jay; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The increased availability of genomic data and the increased ease and lower costs of DNA sequencing have revolutionized biomedical research. One of the critical steps in most bioinformatics analyses is the assembly of the genome sequence of an organism using the data generated from the sequencing machines. Despite the long length of sequences generated by third-generation sequencing technologies (tens of thousands of basepairs), the automated reconstruction of entire genomes continues to be a formidable computational task. Although long read technologies help in resolving highly repetitive regions, the contigs generated from long read assembly do not always span a complete chromosome or even an arm of the chromosome. Recently, new genomic technologies have been developed that can ''bridge" across repeats or other genomic regions that are difficult to sequence or assemble and improve genome assemblies by ''scaffolding" together large segments of the genome. The problem of scaffolding is vital in the context of both single genome assembly of large eukaryotic genomes and in metagenomics where the goal is to assemble multiple bacterial genomes in a sample simultaneously. First, we describe SALSA2, a method we developed to use interaction frequency between any two loci in the genome obtained using Hi-C technology to scaffold fragmented eukaryotic genome assemblies into chromosomes. SALSA2 can be used with either short or long read assembly to generate highly contiguous and accurate chromosome level assemblies. Hi-C data are known to introduce small inversion errors in the assembly, so we included the assembly graph in the scaffolding process and used the sequence overlap information to correct the orientation errors. Next, we present our contributions to metagenomics. We developed a scaffolding and variant detection method MetaCarvel for metagenomic datasets. Several factors such as the presence of inter-genomic repeats, coverage ambiguities, and polymorphic regions in the genomes complicate the task of scaffolding metagenomes. Variant detection is also tricky in metagenomes because the different genomes within these complex samples are not known beforehand. We showed that MetaCarvel was able to generate accurate scaffolds and find genome-wide variations de novo in metagenomic datasets. Finally, we present EDIT, a tool for clustering millions of DNA sequence fragments originating from the highly conserved 16s rRNA gene in bacteria. We extended classical Four Russians' speed up to banded sequence alignment and showed that our method clusters highly similar sequences efficiently. This method can also be used to remove duplicates or near duplicate sequences from a dataset. With the increasing data being generated in different genomic and metagenomic studies using emerging sequencing technologies, our software tools and algorithms are well timed with the need of the community.
  • Thumbnail Image
    Item
    Anti-Profiles for Anomaly Classification and Regression
    (2015) Dinalankara, Wikum; Bravo, Héctor C; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Anomaly detection is a classical problem in Statistical Learning with wide-reaching applications in security, networks, genomics and others. In this work, we formulate the anomaly classification problem as an extension to the detection problem: how to distinguish between samples from multiple heterogenous classes that are anomalies relative to a well-defined, homogenous, normal class. Our formulation of this learning setting arises from studies in cancer genomics, where this problem follows from prognosis and diagnosis applications. Standard binary and multi-class classification schemes are not well suited to the anomaly classification task since they attempt to directly model these highly unstable, heterogeneous classes. In this work, we show that robust classifiers can be obtained by modeling the degree of deviation from the normal class as a stable characteristic of each anomaly class. To do so, we formalize the anomaly classification problem, characterize it statistically and computationally via kernel methods and propose a class of robust learning methods, anti-profiles, specifically designed for this task. We focus on an open area of research in cancer genomics which motivates this project: the classification of tumors for prognosis and diagnosis. We provide experimental results obtained by applying the anti-profile method to gene expression data. In addition we extend the anti-profile approach to use kernel functions, and develop a support-vector machine (SVM) based method for classification of anomalies based on their deviation from a stable normal class. We provide experimental results obtained by applying this method to genetic data to classify different stages of tumor progression, and show that this method provides much more stable classifiers than the application of regular classifiers. In addition we show that this approach can be applied to anomaly classification problems in other application domains. We conclude by developing an SVM for censored survival information and demonstrate that the anti-profile method can produce stable classifiers for modeling the clinical outcome of clinical studies of cancer.
  • Thumbnail Image
    Item
    KNOWLEDGE DISCOVERY FROM GENE EXPRESSION DATA: NOVEL METHODS FOR SIMILARITY SEARCH, SIGNATURE DETECTION, AND CONFOUNDER CORRECTION
    (2012) Licamele, Louis; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Gene expression microarray data is used to answer a variety of scientific questions. For example, it can be used for gaining a better understanding of a drug, segmenting a disease, and predicting an optimal therapeutic response. The amount of gene expression data publicly available is extremely large and continues to grow at an increasing rate. However, this rapid growth of gene expression data from laboratories across the world has not fully achieved its potential impact on the scientific community. This shortcoming is due to the fact that the majority of the data has been gathered under varying conditions, and there is no principled way for combining and fully utilizing related data. Even within a closely controlled gene expression experiment, there are confounding factors that may mask the true signatures when analyzed with current methods. Therefore, we are interested in three core tasks that we believe are important for improving the utilization of gene array data: similarity search, signature detection, and confounder correction. We have developed novel methods that address each of these tasks. In this work, we first address the similarity search problem. More specifically, we propose methods which overcome experimental barriers in pariwise gene expression similarity calculations. We introduce a method, which we refer to as indirect similarity, which, unlike previous approaches, uses all of the information in a database to better inform the similarity calculation of a pair of gene expression profiles. We demonstrate that our method is more robust and better able to cope with experimental barriers such as vehicle and batch effects. We evaluate the ability of our method to retrieve compounds with similar therapeutic effects in two independent datasets. We evaluate the recall ability of our approach and show that our method results in an improvement of 97.03% and 49.44% respectively over existing state of the art approaches. The second problem we focus on is signature detection. Gene expression experiments are performed to test a specific hypothesis. Generally, this hypothesis is that there is some genetic signature common in a group of samples. Current methods try to find the differentially expressed genes within a group of samples using a variety of methods, however, they all are parametric. We introduce a nonparametric approach to group profile creation which we refer to as the Weighted Influence Model - Rank of Ranks method. For every probe on the microarray, the average rank is calculated across all members of a group. These average ranks are then re-ranked to form the group profile. We demonstrate the ability of our group profile method to better understand a disease and the underlying mechanism common to its treatments. Additionally, we demonstrate the predictive power of this group profile to detect novel drugs that could treat a particular disease. This method leads the detection of robust group signatures even with unknown confounding effects. The final problem that we address is the challenge of removing known (annotated) confounding effects from gene expression profiles. We propose an extension to our non-parametric gene expression profile method to correct for observed confounding effects. This correction is performed on ranked lists directly, and it provides a robust alternative to parametric batch profile correction methods. We evaluate our novel profile subtraction method on two real world datasets, comparing against several state-of-the-art parametric methods. We demonstrate an improvement in group signature detection using our method to remove confounding effects. Additionally, we show that in a dataset with the true group assignments removed and only the confounding effects labelled, our profile subtraction method allows for the discovery of the true groups. We evaluate the robustness of our methods using a gene expression profile generator that we developed.