Cell Biology & Molecular Genetics Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2750

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    DEVELOPMENT AND OPTIMIZATION OF TOOLS FOR CO-EXPRESSION NETWORK ANALYSES OF HOST-PATHOGEN SYSTEMS
    (2017) Hughitt, Vincent Keith; El-Sayed, Najib M; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    High-throughput transcriptomics has provided a powerful new approach for studying host-pathogen interactions. While popular techniques such as differential expression and gene set enrichment analysis can yield informative results, they do not always make full use of information available in multi-condition experiments. Co-expression networks provide a novel way of analyzing these datasets which can lead to new discoveries that are not readily detectable using the more popular approaches. While significant work has been done in recent years on the construction of coexpression networks, less is known about how to measure the quality of such networks. Here, I describe an approach for evaluating the quality of a co-expression network, based on enrichment of biological function across the network. The approach is used to measure the influence of various data transformations and algorithmic parameters on the resulting network quality, leading to several unexpected findings regarding commonly-used techniques, as well as to the development of a novel similarity metric used to assess the degree of co-expression between two genes. Next, I describe a simple approach for aggregating information across multiple network parameterizations, in order to arrive at a robust “consensus” co-expression network. This approach is used to generate independent host and parasite networks for two host-trypanosomatid transcriptomics datasets, resulting in the detection of both previously known disease pathways and novel gene networks potentially related to infection. Finally, a differential network analysis approach is developed and used to explore the impact of infection on the host co-expression network, and to elucidate shared transcriptional signatures of infection by different intracellular pathogens. The approaches developed in this work provide a powerful set of tools and techniques for the rigorous generation and evaluation of co-expression networks, and have significant implications for co-expression network-based research. The application of these approaches to several host-pathogen systems demonstrates their utility for host-pathogen transcriptomics research, and has resulted in the creation of a number of valuable resources for understanding systems-levels processes that occur during the process of infection.
  • Thumbnail Image
    Item
    Dinoflagellate Genomic Organization and Phylogenetic Marker Discovery Utilizing Deep Sequencing Data
    (2016) Mendez, Gregory Scott; Delwiche, Charles F; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Dinoflagellates possess large genomes in which most genes are present in many copies. This has made studies of their genomic organization and phylogenetics challenging. Recent advances in sequencing technology have made deep sequencing of dinoflagellate transcriptomes feasible. This dissertation investigates the genomic organization of dinoflagellates to better understand the challenges of assembling dinoflagellate transcriptomic and genomic data from short read sequencing methods, and develops new techniques that utilize deep sequencing data to identify orthologous genes across a diverse set of taxa. To better understand the genomic organization of dinoflagellates, a genomic cosmid clone of the tandemly repeated gene Alchohol Dehydrogenase (AHD) was sequenced and analyzed. The organization of this clone was found to be counter to prevailing hypotheses of genomic organization in dinoflagellates. Further, a new non-canonical splicing motif was described that could greatly improve the automated modeling and annotation of genomic data. A custom phylogenetic marker discovery pipeline, incorporating methods that leverage the statistical power of large data sets was written. A case study on Stramenopiles was undertaken to test the utility in resolving relationships between known groups as well as the phylogenetic affinity of seven unknown taxa. The pipeline generated a set of 373 genes useful as phylogenetic markers that successfully resolved relationships among the major groups of Stramenopiles, and placed all unknown taxa on the tree with strong bootstrap support. This pipeline was then used to discover 668 genes useful as phylogenetic markers in dinoflagellates. Phylogenetic analysis of 58 dinoflagellates, using this set of markers, produced a phylogeny with good support of all branches. The Suessiales were found to be sister to the Peridinales. The Prorocentrales formed a monophyletic group with the Dinophysiales that was sister to the Gonyaulacales. The Gymnodinales was found to be paraphyletic, forming three monophyletic groups. While this pipeline was used to find phylogenetic markers, it will likely also be useful for finding orthologs of interest for other purposes, for the discovery of horizontally transferred genes, and for the separation of sequences in metagenomic data sets.
  • Thumbnail Image
    Item
    COMPUTATIONAL ANALYSES OF MICROBIAL GENOMES - OPERONS, PROTEIN FAMILIES AND LATERAL GENE TRANSFER
    (2005-05-15) Yan, Yongpan; Moult, John; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    As a result of recent successes in genome scale studies, especially genome sequencing, large amounts of new biological data are now available. This naturally challenges the computational world to develop more powerful and precise analysis tools. In this work, three computational studies have been conducted, utilizing complete microbial genome sequences: the detection of operons, the composition of protein families, and the detection of the lateral gene transfer events. In the first study, two computational methods, termed the Gene Neighbor Method (GNM) and the Gene Gap Method (GGM), were developed for the detection of operons in microbial genomes. GNM utilizes the relatively high conservation of order of genes in operons, compared with genes in general. GGM makes use of the relatively short gap between genes in operons compared with that otherwise found between adjacent genes. The two methods were benchmarked using biological pathway data and documented operon data. Operons were predicted for 42 microbial genomes. The predictions are used to infer possible functions for some hypothetical genes in prokaryotic genomes and have proven a useful adjunct to structure information in deriving protein function in our structural genomics project. In the second study, we have developed an automated clustering procedure to classify protein sequences in a set of microbial genomes into protein families. Benchmarking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. The aim of constructing this comprehensive protein family set is to address several questions key to structural genomics. First, our study indicates that approximately 20% of known families with three or more members currently have a representative structure. Second, the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes are sequenced. However, the vast majority of these families will be small. Third, it will be possible to obtain structural templates for 70 - 80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families. The third study is the detection of lateral gene transfer event in microbial genomes. Two new high throughput methods have been developed, and applied to a set of 66 fully sequenced genomes. Both make use of a protein family framework. In the High Apparent Gene Loss (HAGL) method, the number and nature of gene loss events implied by classical evolutionary descent is analyzed. The higher the number of apparent losses, and the smaller the evolutionary distance over which they must have occurred, the more likely that one or more genes have been transferred into the family. The Evolutionary Rate Anomaly (ERA) method associates transfer events with proteins that appear to have an anomalously low rate of sequence change compared with the rest of that protein family. The methods are complementary in that the HAGL method works best with small families and the ERA method best with larger ones. The methods have been parameterized against each other, such that they have high specificity (less than 10% false positives) and can detect about half of the test events. Application to the full set of genomes shows widely varying amounts of lateral gene transfer.