Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 8 of 8
  • Thumbnail Image
    Item
    DATA-DRIVEN ALGORITHMS FOR CHARACTERIZING STRUCTURAL VARIATION IN METAGENOMIC DATA
    (2024) Muralidharan, Harihara Subrahmaniam; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide poly-morphisms, or relying on reference-based analyses that miss complex structural variants. In this thesis, we describe data-driven methods to characterize variation in metagenomic data. In the first part of the thesis, I present our analysis of the structural variants identified from metagenomic whole genome shotgun sequencing data. I begin by describing the power of assembly graph analysis to detect over 9 million structural variants such as insertion/deletion, repeat copy-number changes, and mobile elements, in nearly 1,000 metagenomes generated as a part of the Human Microbiome Project. Next, I describe Binnacle, which is a structural variant aware binning algorithm. To improve upon the fragmented nature of assemblies, metagenomic binning is performed to cluster contigs that is likely to have originated from the same genome. We show that binning “graph-based” scaffolds, rather than contigs, improves the quality of the bins, and captures a broader set of the genes of the genomes being reconstructed. Finally, we present a case study of the microbial mats from the Yellowstone National Park. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ∼ 50oC to ∼ 65oC and plays a key role in fixing carbon and nitrogen. Previous studies have isolated and generated good quality reference sequences of two major Synechococcus spp. that share a very high genomic content; OS-A and OS-B’. Despite the high abundance of the Synechococcus spp., metagenomic assembly of these organisms is challenging due to the large number of rearrangements between them. We explore the genomic diversity of the Synechococcus spp. using a reference genome, reliant assembly and scaffolding. We also highlight that the variants we detect can be used to fingerprint the local biogeography of the hot spring. In the second part of the thesis, I present our analysis of amplicon sequencing data, specifically the 16S rRNA gene sequences. I begin by describing SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate), which is a fast iterative algorithm for clustering large 16S rRNA gene datasets. We also show that SCRAPT is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST and runs orders of magnitude faster than existing methods. Finally, we study the impact of transitive annotation on taxonomic classifiers. Taxonomic labels are assigned using machine learning algorithms trained to recognize individual taxonomic groups based on training sequences with known taxonomic labels. Ideally, the training data should rely on experimentally verified-formal taxonomic labels however, the labels associated with sequences in biological databases are most commonly the result of computational predictions– “transitive annotation.” We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively annotated training data and propose best practices to avoid the artifacts described in this thesis.
  • Thumbnail Image
    Item
    Algorithms and Data Structures for Faster Nearest-Neighbor Classification
    (2022) Flores Velazco, Alejandro; Mount, David; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Given a set P of n labeled points in a metric space (X,d), the nearest-neighbor rule classifies an unlabeled query point q ∈ X with the class of q's closest point in P. Despite the advent of more sophisticated techniques, nearest-neighbor classification is still fundamental for many machine-learning applications. Over the years, this~has motivated numerous research aiming to reduce its high dependency on the size and dimensionality of the data. This dissertation presents various approaches to reduce the dependency of the nearest-neighbor rule from n to some smaller parameter k, that describes the intrinsic complexity of the class boundaries of P. This is of particular significance as it is usually assumed that k ≪ n on real-world training sets. One natural way to achieve this dependency reduction is to reduce the training set itself, selecting a subset R ⊆ P to be used by the nearest-neighbor rule~to~answer incoming queries, instead of using P. Evidently, this approach would reduce the dependencies of the nearest-neighbor rule from n, the size of P, to the size of R. This dissertation explores different techniques to select subsets whose sizes are proportional to k, and that provide varying degrees of correct classification guarantees. Another alternative involves bypassing training set reduction, and instead building data structures designed to answer classification queries directly. To this end, this dissertation proposes the Chromatic AVD; a Quadtree-based data structure designed to answer ε-approximate nearest-neighbor classification queries. The query time and space complexities of this data structure depend on k_ε; a generalization of k that describes the intrinsic complexity of the ε-approximate class boundaries of P.
  • Thumbnail Image
    Item
    EVALUATING CLUSTERING ALGORITHMS TO IDENTIFY SUBPROBLEMS IN DESIGN PROCESSES
    (2017) Morency, Michael John; Herrmann, Jeffrey W; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Design problems are inherently intricate and require multiple dependent decisions. Because of these characteristics, design teams generally choose to decompose the main problem into manageable subproblems. This thesis describes the results of a study designed to (a) explore clustering algorithms as a new and repeatable way to identify subproblems in recorded design team discussions, (b) assess the quality of the identified subproblems, and (c) examine any relationships between the subproblems and final design or team experience level. We observed five teams of public health professionals and four teams of undergraduate students and applied four clustering algorithms to identify the team’s subproblems and achieve the aforementioned research goals. The use of clustering algorithms to identify subproblems has not been documented before, and clustering presents a repeatable and objective method for determining a team’s subproblems. The results from these algorithms as well as metrics noting the each result’s quality were captured for all teams. We learned that each clustering algorithm has strengths and weaknesses depending on how the team discussed the problem, but the algorithms always accurately identify at least some of the discussed subproblems. Studying these identified subproblems reveals a team’s design process and provides insight into their final design choices.
  • Thumbnail Image
    Item
    Promised streaming algorithms and finding pseudo-repetitions
    (2013) Tiseanu, Catalin Stefan; Hajiaghayi, MohammadTaghi; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    As the size of data available for processing increases, new models of computation are needed. This motivates the study of data streams, which are sequences of information for which each element can be read only after the previous one. In this work we study two particular types of streaming variants: promised graph streaming algorithms and combinatorial queries on large words. We give an &omega(n) lower bound for working memory, where n is the number of vertices of the graph, for a variety of problems for which the graphs are promised to be forests. The crux of the proofs is based on reductions from the field of communication complexity. Finally, we give an upper bound for two problems related to finding pseudo-repetitions on words via anti-/morphisms, for which we also propose streaming versions.
  • Thumbnail Image
    Item
    Network Algorithms for Complex Systems with Applications to Non-linear Oscillators and Genome Assembly
    (2013) Schmitt, Karl Robert Bruce; Girvan, Michelle; Zimin, Aleksey; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Network and complex system models are useful for studying a wide range of phenomena, from disease spread to traffic flow. Because of the broad applicability of the framework it is important to develop effective simulations and algorithms for complex networks. This dissertation presents contributions to two applied problems in this area First, we study an electro-optical, nonlinear, and time-delayed feedback loop commonly used in applications that require a broad range of chaotic behavior. For this system we detail a discrete-time simulation model, exploring the model's synchronization behavior under specific coupling conditions. Expanding upon already published results that investigated changes in feedback strength, we explore how both time-delay and nonlinear sensitivity impact synchronization. We also relax the requirement of strictly identical systems components to study how synchronization regions are affected when coupled systems have non-identical components (parameters). Last, we allow wider variance in coupling strengths, including unique strengths to each system, to identify a rich synchronization region not previously seen. In our second application, we take a complex networks approach to improving genome assembly algorithms. One key part of sequencing a genome is solving the orientation problem. The orientation problem is finding the relative orientations for each data fragment generated during sequencing. By viewing the genomic data as a network we can apply standard analysis techniques for community finding and utilize the significantly modular structure of the data. This structure informs development and application of two new heuristics based on (A) genetic algorithms and (B) hierarchical clustering for solving the orientation problem. Genetic algorithms allow us to preserve some internal structure while quickly exploring a large solution space. We present studies using a multi-scale genetic algorithm to solve the orientation problem. We show that this approach can be used in conjunction with currently used methods to identify a better solution to the orientation problem. Our hierarchical algorithm further utilizes the modular structure of the data. By progressively solving and merging sub-problems together we pick optimal `local' solutions while allowing more global corrections to occur later. Our results show significant improvements over current techniques for both generated data and real assembly data.
  • Thumbnail Image
    Item
    Computationally Comparing Biological Networks and Reconstructing Their Evolution
    (2012) Patro, Robert; Kingsford, Carleton L; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Biological networks, such as protein-protein interaction, regulatory, or metabolic networks, provide information about biological function, beyond what can be gleaned from sequence alone. Unfortunately, most computational problems associated with these networks are NP-hard. In this dissertation, we develop algorithms to tackle numerous fundamental problems in the study of biological networks. First, we present a system for classifying the binding affinity of peptides to a diverse array of immunoglobulin antibodies. Computational approaches to this problem are integral to virtual screening and modern drug discovery. Our system is based on an ensemble of support vector machines and exhibits state-of-the-art performance. It placed 1st in the 2010 DREAM5 competition. Second, we investigate the problem of biological network alignment. Aligning the biological networks of different species allows for the discovery of shared structures and conserved pathways. We introduce an original procedure for network alignment based on a novel topological node signature. The pairwise global alignments of biological networks produced by our procedure, when evaluated under multiple metrics, are both more accurate and more robust to noise than those of previous work. Next, we explore the problem of ancestral network reconstruction. Knowing the state of ancestral networks allows us to examine how biological pathways have evolved, and how pathways in extant species have diverged from that of their common ancestor. We describe a novel framework for representing the evolutionary histories of biological networks and present efficient algorithms for reconstructing either a single parsimonious evolutionary history, or an ensemble of near-optimal histories. Under multiple models of network evolution, our approaches are effective at inferring the ancestral network interactions. Additionally, the ensemble approach is robust to noisy input, and can be used to impute missing interactions in experimental data. Finally, we introduce a framework, GrowCode, for learning network growth models. While previous work focuses on developing growth models manually, or on procedures for learning parameters for existing models, GrowCode learns fundamentally new growth models that match target networks in a flexible and user-defined way. We show that models learned by GrowCode produce networks whose target properties match those of real-world networks more closely than existing models.
  • Thumbnail Image
    Item
    COMPUTING APPROXIMATE CUSTOMIZED RANKING
    (2009) Wu, Yao; Raschid, Louiqa; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    As the amount of information grows and as users become more sophisticated, ranking techniques become important building blocks to meet user needs when answering queries. PageRank is one of the most successful link-based ranking methods, which iteratively computes the importance scores for web pages based on the importance scores of incoming pages. Due to its success, PageRank has been applied in a number of applications that require customization. We address the scalability challenges for two types of customized ranking. The first challenge is to compute the ranking of a subgraph. Various Web applications focus on identifying a subgraph, such as focused crawlers and localized search engines. The second challenge is to compute online personalized ranking. Personalized search improves the quality of search results for each user. The user needs are represented by a personalized set of pages or personalized link importance in an entity relationship graph. This requires an efficient online computation. To solve the subgraph ranking problem efficiently, we estimate the ranking scores for a subgraph. We propose a framework of an exact solution (IdealRank) and an approximate solution (ApproxRank) for computing ranking on a subgraph. Both IdealRank and ApproxRank represent the set of external pages with an external node $\Lambda$ and modify the PageRank-style transition matrix with respect to $\Lambda$. The IdealRank algorithm assumes that the scores of external pages are known. We prove that the IdealRank scores for pages in the subgraph converge to the true PageRank scores. Since the PageRank-style scores of external pages may not typically be available, we propose the ApproxRank algorithm to estimate scores for the subgraph. We analyze the $L_1$ distance between IdealRank scores and ApproxRank scores of the subgraph and show that it is within a constant factor of the $L_1$ distance of the external pages. We demonstrate with real and synthetic data that ApproxRank provides a good approximation to PageRank for a variety of subgraphs. We consider online personalization using ObjectRank; it is an authority flow based ranking for entity relationship graphs. We formalize the concept of an aggregate surfer on a data graph; the surfer's behavior is controlled by multiple personalized rankings. We prove a linearity theorem over these rankings which can be used as a tool to scale this type of personalization. DataApprox uses a repository of precomputed rankings for a given set of link weights assignments. We define DataApprox as an optimization problem; it selects a subset of the precomputed rankings from the repository and produce a weighted combination of these rankings. We analyze the $L_1$ distance between the DataApprox scores and the real authority flow ranking scores and show that DataApprox has a theoretical bound. Our experiments on the DBLP data graph show that DataApprox performs well in practice and allows fast and accurate personalized authority flow ranking.
  • Thumbnail Image
    Item
    MODELING AND ANALYSIS OF MASSIVE SOCIAL NETWORKS
    (2005-06-16) Wang, Nan; Srinivasan, Aravind; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Traditional epidemiological research has focused on rate-based differential-equation models with completely mixing populations. Although successful in explaining certain phenomena of disease spreading, the traditional approach is unable to deal with disease spreading in realistic massive social networks, where most people only mix locally with few other people. We have develop an approach based on a combination of network theory and discrete-event simulations to study epidemics in large urban areas, which do not assume complete mixing populations. Our results include (1) detailed structural and temporal analyses of the social contact networks produced by TRANSIMS, a simulator for detailed transportation/traffic studies; (2) realistic simulation of contagious diseases (e.g., smallpox) on the social contact networks through EpiSims, a simulation-based analytical tool to study the spread of infectious diseases in an urban environment; (3) identifying a number of new measures that are significant for understanding epidemics and for developing new strategies in policy planning; (4) introduction of random graph models for theoretical analysis of the structural and algorithmic aspects of the social networks; and (5) combinatorial formulations and approximation algorithms for performing quarantine, vaccination and sensor placement, as aids to decision-making. The social network that we have mostly dealt with is for the city of Portland, Oregon, USA, developed as a part of the TRANSIMS/EpiSims project at the Los Alamos National Laboratory. The most expressive social contact network is a bipartite graph, representing people and locations; edges represent people visiting locations on a typical day. We also build random graph models to generate a family of social networks by taking as input some basic parameters of the Portland social network, and analyze social networks generated by these models.