Computer Science Research Works
Permanent URI for this collectionhttp://hdl.handle.net/1903/1593
Browse
6 results
Search Results
Item Coral: an integrated suite of visualizations for comparing clusterings(Springer Nature, 2012-10-29) Filippova, Darya; Gadani, Aashish; Kingsford, CarlClustering has become a standard analysis for many types of biological data (e.g interaction networks, gene expression, metagenomic abundance). In practice, it is possible to obtain a large number of contradictory clusterings by varying which clustering algorithm is used, which data attributes are considered, how algorithmic parameters are set, and which near-optimal clusterings are chosen. It is a difficult task to sift though such a large collection of varied clusterings to determine which clustering features are affected by parameter settings or are artifacts of particular algorithms and which represent meaningful patterns. Knowing which items are often clustered together helps to improve our understanding of the underlying data and to increase our confidence about generated modules. We present Coral, an application for interactive exploration of large ensembles of clusterings. Coral makes all-to-all clustering comparison easy, supports exploration of individual clusterings, allows tracking modules across clusterings, and supports identification of core and peripheral items in modules. We discuss how each visual component in Coral tackles a specific question related to clustering comparison and provide examples of their use. We also show how Coral could be used to visually and quantitatively compare clusterings with a ground truth clustering. As a case study, we compare clusterings of a recently published protein interaction network of Arabidopsis thaliana. We use several popular algorithms to generate the network’s clusterings. We find that the clusterings vary significantly and that few proteins are consistently co-clustered in all clusterings. This is evidence that several clusterings should typically be considered when evaluating modules of genes, proteins, or sequences, and Coral can be used to perform a comprehensive analysis of these clustering ensembles.Item Parsimonious reconstruction of network evolution(2012-09-19) Patro, Rob; Sefer, Emre; Malin, Justin; Marcais, Guillaume; Navlakha, Saket; Kingsford, CarlBackground: Understanding the evolution of biological networks can provide insight into how their modular structure arises and how they are affected by environmental changes. One approach to studying the evolution of these networks is to reconstruct plausible common ancestors of present-day networks, allowing us to analyze how the topological properties change over time and to posit mechanisms that drive the networks’ evolution. Further, putative ancestral networks can be used to help solve other difficult problems in computational biology, such as network alignment. Results: We introduce a combinatorial framework for encoding network histories, and we give a fast procedure that, given a set of gene duplication histories, in practice finds network histories with close to the minimum number of interaction gain or loss events to explain the observed present-day networks. In contrast to previous studies, our method does not require knowing the relative ordering of unrelated duplication events. Results on simulated histories and real biological networks both suggest that common ancestral networks can be accurately reconstructed using this parsimony approach. A software package implementing our method is available under the Apache 2.0 license at http://cbcb.umd.edu/kingsford-group/parana. Conclusions: Our parsimony-based approach to ancestral network reconstruction is both efficient and accurate. We show that considering a larger set of potential ancestral interactions by not assuming a relative ordering of unrelated duplication events can lead to improved ancestral network inference.Item Assembly complexity of prokaryotic genomes using short reads(2010-01-12) Kingsford, Carl; Schatz, Michael C; Pop, MihaiBackground: De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results: We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions: Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.Item Alignment and clustering of phylogenetic markers - implications for microbial diversity studies(2010-03-24) White, James R; Navlakha, Saket; Nagarajan, Niranjan; Ghodsi, Mohammad-Reza; Kingsford, Carl; Pop, MihaiBackground: Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies. Results: Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters. Conclusions: Our results highlight the need for systematic and open evaluation of data analysis methodologies,especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website http://fames.jgi-psf. org/.Item Global Network Alignment Using Multiscale Spectral Signatures(2011) Patro, Rob; Kingsford, CarlMotivation: Protein interaction networks provide an important system-level view of biological processes. One of the fundamental problems in biological network analysis is the global alignment of a pair of networks, which puts the proteins of one network into correspondence with the proteins of another network in a manner that conserves their interactions while respecting other evidence of their homology. By providing a mapping between the networks of different species, alignments can be used to inform hypotheses about the functions of unannotated proteins, the existence of unobserved interactions, the evolutionary divergence between the two species and the evolution of complexes and pathways. Results: We introduce GHOST, a global pairwise network aligner that uses a novel spectral signature to measure topological similarity across disparate networks. It exhibits state-of-the-art performance on several network alignment tasks. We show that the spectral signature used by GHOST is highly discriminative, while the alignments it produces are also robust to experimental noise. When compared with other recent approaches, we find that GHOST is able to recover larger and biologically-significant, shared subnetworks between species. Availability: An efficient and parallelized implementation of GHOST, released under the Apache 2.0 license, is available at http:// cbcb.umd.edu/kingsford-group/ghostItem Graph rigidity reveals non-deformable collections of chromosome conformation constraints(2011-12-14) Duggal, Geet; Kingsford, CarlMotivation: The physical structure of chromatin is associated with a variety of biological phenomena including long-range regulation, chromosome rearrangements, and somatic copy number alterations. Chromosome conformation capture is a recent experimental technique that results in pairwise proximity measurements between chromosome locations in a genome. This information can be used to construct three-dimensional models of portions of chromosomes or entire genomes using a variety of recently proposed algorithms. However, it is possible that these distance measurements do not provide the proper constraints to uniquely specify such an embedding. It is therefore necessary to separate regions of the chromatin structure that are sufficiently constrained from regions with measurements that suggest a more pliable structure. This separation will allow studies of correlations betweeen chromatin organization and genome function to be targeted to the sufficiently constrained, high-confidence substructures within an embedding. Results: Using rigidity theory, we introduce a novel, fast algorithm for identifying high-confidence (rigid) substructures within graphs that result from chromosome conformation capture experiments. We apply the method to four recent chromosome conformation capture data sets and find that for even stringently filtered experimental constraints, a large rigid region spans most of the genome. We find that the organization of rigid components depends crucially on short-range interactions within the genome. We also find that rigid component boundaries appear at regions associated with areas of low nucleosome density and that properties of rigid, subtelomeric regions are consistent with light microscopy data. Availability: The software for identifying rigid components is GPL-Licensed and available for download at http://www.cbcb.umd.edu/kingsford-group/starfish. Contact: carlk@cs.umd.edu