Mathematics
Permanent URI for this communityhttp://hdl.handle.net/1903/2261
Browse
4 results
Search Results
Item Bayesian Estimation of the Inbreeding Coefficient for Single Nucleotide Polymorphism Using Complex Survey Data(2015) Xue, Zhenyi; Lahiri, Partha; Li, Yan; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In genome-wide association studies (GWAS), single nucleotide polymorphism (SNP) is often used as a genetic marker to study gene-disease association. Some large scale health sample surveys have recently started collecting genetic data. There is now growing interest in developing statistical procedures using genetic survey data. This calls for innovative statistical methods that incorporate both genetic and statistical sampling. Under simple random sampling, the traditional estimator of the inbreeding coefficient is given by 1 - (number of observed heterozygotes) / (number of expected heterozygotes). Genetic data quality control reports published by the National Health and Nutrition Examination Survey (NHANES) and the Health and Retirement Study (HRS) use this simple estimator, which serves as a reasonable quality control tool to identify problems such as genotyping error. There is, however, a need to improve on this estimator by considering different features of the complex survey design. The main goal of this dissertation is to fill in this important research gap. First, a design-based estimator and its associated jackknife standard error estimator are proposed. Secondly, a hierarchical Bayesian methodology is developed using the effective sample size and genotype count. Lastly, a Bayesian pseudo-empirical likelihood estimator is proposed using the expected number of heterozygotes in the estimating equation as a constraint when maximizing the pseudo-empirical likelihood. One of the advantages of the proposed Bayesian methodology is that the prior distribution can be used to restrict the parameter space induced by the general inbreeding model. The proposed estimators are evaluated using Monte Carlo simulation studies. Moreover, the proposed estimates of the inbreeding coefficients of SNPs from APOC1 and BDNF genes are compared using the data from the 2006 Health and Retirement Study.Item MULTIVARIATE METHODS FOR HIGH-THROUGHPUT BIOLOGICAL DATA WITH APPLICATION TO COMPARATIVE GENOMICS(2015) Hsiao, Chiao-wen; Corrada Bravo, Héctor; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Phenotypic variation in multi-cellular organisms arises as a result complex gene regulation mechanisms. Modern development of high-through technology opens up the possibility of genome-wide interrogation of aspects of these mechanisms across molecular phenotypes. Multivariate statistical methods provide convenient frameworks for modeling and analyzing data obtained from high-throughput experiments probing these complex aspects. This dissertation presents multivariate statistical methods to analyze data arising from two specific high-throughput molecular assays: (1) ribosome footprint profiling experiments, and (2) flow cytometry data. Ribosome footprint profiling describes an in vivo translation profile in a living cell and offers insights into the process of post-transcriptional gene regulation. Translation efficiency (TE) is a measure that quantifies the rate at which active translation is occurring for each gene – defined as the ratio of ribosome protected fragment count to mRNA fragment count. We introduce pairedSeq, an empirical covariance shrinkage method for differential testing of translation efficiency from sequencing data. The method draws on variance decomposition techniques in mixed-effect modeling and analysis of variance. Benchmark tests comparing to the existing methods reveals that pairedSeq effectively detects signals in genes with high variation in expression measurements across samples due to high co-variability between ribosome occupancy and transcript abundance. In contrast, existing methods tend to mistake genes with negative co-variability as signals, as a result of variance underestimation when not accounting for negative co-variability. We then present a genome-wide survey of primate species divergence at the translational and post-translational layer of gene regulation. FCM is routinely employed to characterize cellular characteristics such as mRNA and protein expression at the single-cell level. While many computational methods have been developed that focus on identifying cell populations in individual FCM samples, very few have addressed how the identified cell populations can be matched across samples for comparative analysis. FlowMap-FR can be used to quantify the similarity between cell populations under scenarios of proportion differences and modest position shifts, and to identify situations in which inappropriate splitting or merging of cell populations has occurred during gating procedures. It has been implemented as a stand-alone R/Bioconductor package easily incorporated into current FCM data analytical workflows.Item Genome Assembly Techniques(2011) Marcais, Guillaume; Yorke, James; Kingsford, Carl; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Since the publication of the human genome in 2001, the price and the time of DNA sequencing have dropped dramatically. The genome of many more species have since been sequenced, and genome sequencing is an ever more important tool for biologists. This trend will likely revolutionize biology and medicine in the near future where the genome sequence of each individual person, instead of a model genome for the human, becomes readily accessible. Nevertheless, genome assembly remains a challenging computational problem, even more so with second generation sequencing technologies which generate a greater amount of data and make the assembly process more complex. Research to quickly, cheaply and accurately assemble the increasing amount of DNA sequenced is of great practical importance. In the first part of this thesis, we present two software developed to improve genome assemblies. First, Jellyfish is a fast k-mer counter, capable of handling large data sets. k-mer frequencies are central to many tasks in genome assembly (e.g. for error correction, finding read overlaps) and other study of the genome (e.g. finding highly repeated sequences such as transposons). Second, Chromosome Builder is a scaffolder and contig placement software. It aims at improving the accuracy of genome assembly. In the second part of this thesis we explore several problems dealing with graphs. The theory of graphs can be used to solve many computational problems. For example, the genome assembly problem can be represented as finding an Eulerian path in a de Bruijn graph. The physical interactions between proteins (PPI network), or between transcription factors and genes (regulatory networks), are naturally expressed as graphs. First, we introduce the concept of "exactly 3-edge-connected" graphs. These graphs have only a remote biological motivation but are interesting in their own right. Second, we study the reconstruction of ancestral network which aims at inferring the state of ancestral species' biological networks based on the networks of current species.Item Mathematical modeling of drug resistance and cancer stem cells dynamics(2010) Tomasetti, Cristian; Levy, Doron; Dolgopyat, Dmitry; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation we consider the dynamics of drug resistance in cancer and the related issue of the dynamics of cancer stem cells. Our focus is only on resistance which is caused by random genetic point mutations. A very simple system of ordinary differential equations allows us to obtain results that are comparable to those found in the literature with one important difference. We show that the amount of resistance that is generated before the beginning of the treatment, and which is present at some given time afterward, always depends on the turnover rate, no matter how many drugs are used. Previous work in the literature indicated no dependence on the turnover rate in the single drug case while a strong dependence in the multi-drug case. We develop a new methodology in order to derive an estimate of the probability of developing resistance to drugs by the time a tumor is diagnosed and the expected number of drug-resistant cells found at detection if resistance is present at detection. Our modeling methodology may be seen as more general than previous approaches, in the sense that at least for the wild-type population we make assumptions only on their averaged behavior (no Markov property for example). Importantly, the heterogeneity of the cancer population is taken into account. Moreover, in the case of chronic myeloid leukemia (CML), which is a cancer of the white blood cells, we are able to infer the preferred mode of division of the hematopoietic cancer stem cells, predicting a large shift from asymmetric division to symmetric renewal. We extend our results by relaxing the assumption on the average growth of the tumor, thus going beyond the standard exponential case, and showing that our results may be a good approximation also for much more general forms of tumor growth models. Finally, after reviewing the basic modeling assumptions and main results found in the mathematical modeling literature on chronic myeloid leukemia (CML), we formulate a new hypothesis on the effects that the drug Imatinib has on leukemic stem cells. Based on this hypothesis, we obtain new insights on the dynamics of the development of drug resistance in CML.