Computer Science Research Works

Permanent URI for this collection


Recent Submissions

Now showing 1 - 5 of 128
  • Item
    Perplexity: evaluating transcript abundance estimation in the absence of ground truth
    (Springer Nature, 2022-03-25) Fan, Jason; Chan, Skylar; Patro, Rob
    There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models. Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
  • Item
    Few amino acid positions in rpoB are associated with most of the rifampin resistance in Mycobacterium tuberculosis
    (Springer Nature, 2004-09-28) Cummings, Michael P; Segal, Mark R
    Mutations in rpoB, the gene encoding the β subunit of DNA-dependent RNA polymerase, are associated with rifampin resistance in Mycobacterium tuberculosis. Several studies have been conducted where minimum inhibitory concentration (MIC, which is defined as the minimum concentration of the antibiotic in a given culture medium below which bacterial growth is not inhibited) of rifampin has been measured and partial DNA sequences have been determined for rpoB in different isolates of M. tuberculosis. However, no model has been constructed to predict rifampin resistance based on sequence information alone. Such a model might provide the basis for quantifying rifampin resistance status based exclusively on DNA sequence data and thus eliminate the requirements for time consuming culturing and antibiotic testing of clinical isolates. Sequence data for amino acid positions 511–533 of rpoB and associated MIC of rifampin for different isolates of M. tuberculosis were taken from studies examining rifampin resistance in clinical samples from New York City and throughout Japan. We used tree-based statistical methods and random forests to generate models of the relationships between rpoB amino acid sequence and rifampin resistance. The proportion of variance explained by a relatively simple tree-based cross-validated regression model involving two amino acid positions (526 and 531) is 0.679. The first partition in the data, based on position 531, results in groups that differ one hundredfold in mean MIC (1.596 μg/ml and 159.676 μg/ml). The subsequent partition based on position 526, the most variable in this region, results in a > 354-fold difference in MIC. When considered as a classification problem (susceptible or resistant), a cross-validated tree-based model correctly classified most (0.884) of the observations and was very similar to the regression model. Random forest analysis of the MIC data as a continuous variable, a regression problem, produced a model that explained 0.861 of the variance. The random forest analysis of the MIC data as discrete classes produced a model that correctly classified 0.942 of the observations with sensitivity of 0.958 and specificity of 0.885. Highly accurate regression and classification models of rifampin resistance can be made based on this short sequence region. Models may be better with improved (and consistent) measurements of MIC and more sequence data.
  • Item
    Genome re-annotation: a wiki solution?
    (Springer Nature, 2007-02-01) Salzberg, Steven L
    The annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution.
  • Item
    A finite element model for protein transport in vivo
    (Springer Nature, 2007-06-28) Sadegh Zadeh, Kouroush; Elman, Howard C; Montas, Hubert J; Shirmohammadi, Adel
    Biological mass transport processes determine the behavior and function of cells, regulate interactions between synthetic agents and recipient targets, and are key elements in the design and use of biosensors. Accurately predicting the outcomes of such processes is crucial to both enhancing our understanding of how these systems function, enabling the design of effective strategies to control their function, and verifying that engineered solutions perform according to plan. A Galerkin-based finite element model was developed and implemented to solve a system of two coupled partial differential equations governing biomolecule transport and reaction in live cells. The simulator was coupled, in the framework of an inverse modeling strategy, with an optimization algorithm and an experimental time series, obtained by the Fluorescence Recovery after Photobleaching (FRAP) technique, to estimate biomolecule mass transport and reaction rate parameters. In the inverse algorithm, an adaptive method was implemented to calculate sensitivity matrix. A multi-criteria termination rule was developed to stop the inverse code at the solution. The applicability of the model was illustrated by simulating the mobility and binding of GFP-tagged glucocorticoid receptor in the nucleoplasm of mouse adenocarcinoma. The numerical simulator shows excellent agreement with the analytic solutions and experimental FRAP data. Detailed residual analysis indicates that residuals have zero mean and constant variance and are normally distributed and uncorrelated. Therefore, the necessary and sufficient criteria for least square parameter optimization, which was used in this study, were met.The developed strategy is an efficient approach to extract as much physiochemical information from the FRAP protocol as possible. Well-posedness analysis of the inverse problem, however, indicates that the FRAP protocol provides insufficient information for unique simultaneous estimation of diffusion coefficient and binding rate parameters. Care should be exercised in drawing inferences, from FRAP data, regarding concentrations of free and bound proteins, average binding and diffusion times, and protein mobility unless they are confirmed by long-range Markov Chain-Monte Carlo (MCMC) methods and experimental observations.
  • Item
    Comparative study of meningitis dynamics across nine African countries: a global perspective
    (Springer Nature, 2007-07-10) Broutin, Hélène; Philippon, Solenne; de Magny, Guillaume Constantin; Courel, Marie-Françoise; Sultan, Benjamin; Guégan, Jean-François
    Meningococcal meningitis (MM) represents an important public health problem especially in the "meningitis belt" in Africa. Although seasonality of epidemics is well known with outbreaks usually starting in the dry season, pluri-annual cycles are still less understood and even studied. In this context, we aimed at study MM cases time series across 9 sahelo-sudanian countries to detect pluri-annual periodicity and determine or not synchrony between dynamics. This global and comparative approach allows a better understanding of MM evolution in time and space in the long-term. We used the most adapted mathematical tool to time series analyses, the wavelet method. We showed that, despite a strong consensus on the existence of a global pluri-annual cycle of MM epidemics, it is not the case. Indeed, even if a clear cycle is detected in all countries, these cycles are not as permanent and regular as generally admitted since many years. Moreover, no global synchrony was detected although many countries seemed correlated. These results of the first large-scale study of MM dynamics highlight the strong interest and the necessity of a global survey of MM in order to be able to predict and prevent large epidemics by adapted vaccination strategy. International cooperation in Public Health and cross-disciplines studies are highly recommended to hope controlling this infectious disease.