Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    Item
    Network Algorithms for Complex Systems with Applications to Non-linear Oscillators and Genome Assembly
    (2013) Schmitt, Karl Robert Bruce; Girvan, Michelle; Zimin, Aleksey; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Network and complex system models are useful for studying a wide range of phenomena, from disease spread to traffic flow. Because of the broad applicability of the framework it is important to develop effective simulations and algorithms for complex networks. This dissertation presents contributions to two applied problems in this area First, we study an electro-optical, nonlinear, and time-delayed feedback loop commonly used in applications that require a broad range of chaotic behavior. For this system we detail a discrete-time simulation model, exploring the model's synchronization behavior under specific coupling conditions. Expanding upon already published results that investigated changes in feedback strength, we explore how both time-delay and nonlinear sensitivity impact synchronization. We also relax the requirement of strictly identical systems components to study how synchronization regions are affected when coupled systems have non-identical components (parameters). Last, we allow wider variance in coupling strengths, including unique strengths to each system, to identify a rich synchronization region not previously seen. In our second application, we take a complex networks approach to improving genome assembly algorithms. One key part of sequencing a genome is solving the orientation problem. The orientation problem is finding the relative orientations for each data fragment generated during sequencing. By viewing the genomic data as a network we can apply standard analysis techniques for community finding and utilize the significantly modular structure of the data. This structure informs development and application of two new heuristics based on (A) genetic algorithms and (B) hierarchical clustering for solving the orientation problem. Genetic algorithms allow us to preserve some internal structure while quickly exploring a large solution space. We present studies using a multi-scale genetic algorithm to solve the orientation problem. We show that this approach can be used in conjunction with currently used methods to identify a better solution to the orientation problem. Our hierarchical algorithm further utilizes the modular structure of the data. By progressively solving and merging sub-problems together we pick optimal `local' solutions while allowing more global corrections to occur later. Our results show significant improvements over current techniques for both generated data and real assembly data.
  • Thumbnail Image
    Item
    Computational methods to improve genome assembly and gene prediction
    (2011) Kelley, David Roy; Salzberg, Steven L; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    DNA sequencing is used to read the nucleotides composing the genetic material that forms individual organisms. As 2nd generation sequencing technologies offering high throughput at a feasible cost have matured, sequencing has permeated nearly all areas of biological research. By a combination of large-scale projects led by consortiums and smaller endeavors led by individual labs, the flood of sequencing data will continue, which should provide major insights into how genomes produce physical characteristics, including disease, and evolve. To realize this potential, computer science is required to develop the bioinformatics pipelines to efficiently and accurately process and analyze the data from large and noisy datasets. Here, I focus on two crucial bioinformatics applications: the assembly of a genome from sequencing reads and protein-coding gene prediction. In genome assembly, we form large contiguous genomic sequences from the short sequence fragments generated by current machines. Starting from the raw sequences, we developed software called Quake that corrects sequencing errors more accurately than previous programs by using coverage of k-mers and probabilistic modeling of sequencing errors. My experiments show correcting errors with Quake improves genome assembly and leads to the detection of more polymorphisms in re-sequencing studies. For post-assembly analysis, we designed a method to detect a particular type of mis-assembly where the two copies of each chromosome in diploid genomes diverge. We found thousands of examples in each of the chimpanzee, cow, and chicken public genome assemblies that created false segmental duplications. Shotgun sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to both discover unknown microbes and explore complex environments. We developed software called Scimm that clusters metagenomic sequences based on composition in an unsupervised fashion more accurately than previous approaches. Finally, we extended an approach for predicting protein-coding genes on whole genomes to metagenomic sequences by adding new discriminative features and augmenting the task with taxonomic classification and clustering of the sequences. The program, called Glimmer-MG, predicts genes more accurately than all previous methods. By adding a model for sequencing errors that also allows the program to predict insertions and deletions, accuracy significantly improves on error-prone sequences.
  • Thumbnail Image
    Item
    Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing
    (2009) Langmead, Benjamin Thomas; Salzberg, Steven L; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Improvements in DNA sequencing have both broadened its utility and dramatically increased the size of sequencing datasets. Sequencing instruments are now used regularly as sources of high-resolution evidence for genotyping, methylation profiling, DNA-protein interaction mapping, and characterizing gene expression in the human genome and in other species. With existing methods, the computational cost of aligning short reads from the Illumina instrument to a mammalian genome can be very large: on the order of many CPU months for one human genotyping project. This thesis presents a novel application of the Burrows-Wheeler Transform that enables the alignment of short DNA sequences to mammalian genomes at a rate much faster than existing hashtable-based methods. The thesis also presents an extension of the technique that exploits the scalability of Cloud Computing to perform the equivalent of one human genotyping project in hours.
  • Thumbnail Image
    Item
    FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION
    (2007-11-27) Islamaj, Rezarta; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Sequence classification is an important problem in many real-world applications. Sequence data often contain no explicit "signals," or features, to enable the construction of classification algorithms. Extracting and interpreting the most useful features is challenging, and hand construction of good features is the basis of many classification algorithms. In this thesis, I address this problem by developing a feature-generation algorithm (FGA). FGA is a scalable method for automatic feature generation for sequences; it identifies sequence components and uses domain knowledge, systematically constructs features, explores the space of possible features, and identifies the most useful ones. In the domain of biological sequences, splice-sites are locations in DNA sequences that signal the boundaries between genetic information and intervening non-coding regions. Only when splice-sites are identified with nucleotide precision can the genetic information be translated to produce functional proteins. In this thesis, I address this fundamental process by developing a highly accurate splice-site prediction model that employs our sequence feature-generation framework. The FGA model shows statistically significant improvements over state-of-the-art splice-site prediction methods. So that biologists can understand and interpret the features FGA constructs, I developed SplicePort, a web-based tool for splice-site prediction and analysis. With SplicePort the user can explore the relevant features for splicing, and can obtain splice-site predictions for the sequences based on these features. For an experimental biologist trying to identify the critical sequence elements of splicing, SplicePort offers flexibility and a rich motif exploration functionality, which may help to significantly reduce the amount of experimentation needed. In this thesis, I present examples of the observed feature groups and describe efforts to detect biological signals that may be important for the splicing process. Naturally, FGA can be generalized to other biologically inspired classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, as well as other sequence classification problems, provided we have sufficient knowledge of the new domain.