Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 9 of 9
  • Thumbnail Image
    Item
    ANALYTICAL APPROACHES FOR COMPLEX MULTI-BATCH -OMICS DATASETS AND THEIR APPLICATION TO NEURONAL DEVELOPMENT
    (2023) Alexander, Theresa Ann; Speer, Colenso M; El-Sayed, Najib M; Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    High-throughput sequencing methods are extremely powerful tools to quantify gene expression in bulk tissue and individual cells. Experimental designs often aim to quantify expression shifts to characterize developmental trajectories, disease states, or cellular drug responses. Experimental and genetic methods are also rapidly evolving to capture specific aspects of gene expression such as in targeting individual cell types, regulatory stages, and spatially resolved cell subcompartments. These studies frequently involve a variety of experimental conditions that require many samples to guarantee sufficient statistical power for subsequent analyses. These studies are frequently processed in multiple batches due to limitations on the number of samples that can be collected, processed, and sequenced at once. To eliminate erroneous results in subsequent analyses, it is necessary to deconvolve non-biological variation (batch effect) from biological signal. Here, we explored variational contributions in multi-batch high throughput sequencing experiments by developing new methods, evaluating heterogeneity-contributors in an axon-TRAP-RiboTag protocol case-study, and highlighting biological results from this protocol. First, we describe iDA, a novel dimensionality reduction method for high-throughput sequencing data. High-dimensional data in complex, multi-batch experiments often result in discrete clustering of samples or cells. Existing unsupervised linear dimensionality reduction methods like PCA often do not resolve discreteness simply with projections of maximum variance. We show that iDA can produce better projections for separating discrete clustering that correlates with known experimental biological and batch factors. Second, we provide a case study of special considerations for a complex, multi-batch high throughput experiment. We investigated the multi-faceted heterogenic contributions of a study using the axon-TRAP-RiboTag translatomic isolation protocol in a neuronal cell type. We show that popular batch-correction methods may reduce signal due to true biological heterogeneity in addition to technical noise. We offer metrics to help identify biological signal-driven batch-differences. Lastly, we employ our understanding of variational contributions in the intrinsically photosensitive retinal ganglion cell (ipRGC) -omics case study to explore the biological transcriptomic and translatomic coordination. Our analysis revealed ipRGCs participate in subcompartment-specific local protein translation. Genetic perturbations of photopigment-driven neuronal activity led to global tissue transcriptomic shifts in both the retina and brain targets, but the ipRGC axonal-specific translatome was unaltered.
  • Thumbnail Image
    Item
    Diet and Stomach Microbiota of Gulf Menhaden, a key forage filter feeding fish species
    (2020) Hanif, Ammar Wali; Jagus, Rosemary; Marine-Estuarine-Environmental Sciences; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Menhaden represent a family of important filter feeding forage fish that serves as a trophic link between plankton and piscivorous predators in the marine environment. Dietary analysis is difficult because diet items are small and >80 % of the stomach content is amorphous material. DNA metabarcoding combines mass-amplification of short DNA sequences (barcodes) with high-throughput sequencing. This application allows the simultaneous identification of many taxa within the same environmental sample, as well as the analysis of many samples simultaneously, providing a comprehensive assessment of diet items and gut microbiota. Here we present a methodological approach using DNA metabarcoding suitable for a small filter feeding fish to identify the stomach contents of juvenile Gulf menhaden (Brevoortia patronus), collected within Apalachicola Bay, Florida. I describe the optimization of DNA extraction, comparison of two primers and sequencing protocols, estimation of menhaden DNA contamination, quality filtering of sequences, post-sequence processing and taxonomic identification of recovered sequences. I characterized the prokaryotic community using 16S universal ribosomal RNA (rRNA) gene sequencing primers in the V3-V4 hypervariable regions. Using two different sequencing protocols employing different “universal” 16S rRNA gene sequencing primers. Although no difference in overall operational taxonomic units (OTUs) was found, the two sequencing protocols gave differences in the relative abundancies of several bacterial classes. The dominant OTUs resulting from 16S rRNA gene sequencing at the phylum level were assigned to Proteobacteria, Acidobacteria, Actinobacteria and Chloroflexi and included oil eating bacteria consistent with the Gulf of Mexico location. Stomach microbiota and diet were compared in juvenile Gulf menhaden, Brevoortia patronus, caught at two locations, Two Mile Channel and St. Vincent Sound, in Apalachicola Bay, FL in May and July of 2013. The stomach microbiota of samples from both locations showed a predominance of Proteobacteria, Chloroflexi, Bacteroidetes, Acidobacteria and Actinobacteria, although significant differences in composition at the class level were seen. The stomach microbiota from fish from Two-Mile Channel showed a higher level of taxonomic richness and there was a strong association between the microbiota and sampling location, correlating with differences in salinity. Approximately 1050 diet items were identified, although significant differences in the species represented were found in samples from the two locations. Members of the Stramenopile/ Alveolate/Rhizaria (SAR) clade accounted for 66 % representation in samples from Two Mile Channel, dominated by the diatoms Cyclotella and Skeletonema, as well as the ciliate Oligotrichia. In contrast, Metazoa (zooplankton) dominated in samples from St. Vincent Sound, accounting for over 80 % of the reads. These are mainly Acartia copepods. Since ciliates are considered to be microzooplankton, this means there is just over 60 % representation of phytoplankton in samples from Two Mile Channel and over 90 % representation of zooplankton in samples from St. Vincent Sound. Overall, I demonstrate the diversity of juvenile menhaden stomach contents that supports a characterization of menhaden as environmental samplers.
  • Thumbnail Image
    Item
    DEVELOPMENT AND OPTIMIZATION OF TOOLS FOR CO-EXPRESSION NETWORK ANALYSES OF HOST-PATHOGEN SYSTEMS
    (2017) Hughitt, Vincent Keith; El-Sayed, Najib M; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    High-throughput transcriptomics has provided a powerful new approach for studying host-pathogen interactions. While popular techniques such as differential expression and gene set enrichment analysis can yield informative results, they do not always make full use of information available in multi-condition experiments. Co-expression networks provide a novel way of analyzing these datasets which can lead to new discoveries that are not readily detectable using the more popular approaches. While significant work has been done in recent years on the construction of coexpression networks, less is known about how to measure the quality of such networks. Here, I describe an approach for evaluating the quality of a co-expression network, based on enrichment of biological function across the network. The approach is used to measure the influence of various data transformations and algorithmic parameters on the resulting network quality, leading to several unexpected findings regarding commonly-used techniques, as well as to the development of a novel similarity metric used to assess the degree of co-expression between two genes. Next, I describe a simple approach for aggregating information across multiple network parameterizations, in order to arrive at a robust “consensus” co-expression network. This approach is used to generate independent host and parasite networks for two host-trypanosomatid transcriptomics datasets, resulting in the detection of both previously known disease pathways and novel gene networks potentially related to infection. Finally, a differential network analysis approach is developed and used to explore the impact of infection on the host co-expression network, and to elucidate shared transcriptional signatures of infection by different intracellular pathogens. The approaches developed in this work provide a powerful set of tools and techniques for the rigorous generation and evaluation of co-expression networks, and have significant implications for co-expression network-based research. The application of these approaches to several host-pathogen systems demonstrates their utility for host-pathogen transcriptomics research, and has resulted in the creation of a number of valuable resources for understanding systems-levels processes that occur during the process of infection.
  • Thumbnail Image
    Item
    Dinoflagellate Genomic Organization and Phylogenetic Marker Discovery Utilizing Deep Sequencing Data
    (2016) Mendez, Gregory Scott; Delwiche, Charles F; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Dinoflagellates possess large genomes in which most genes are present in many copies. This has made studies of their genomic organization and phylogenetics challenging. Recent advances in sequencing technology have made deep sequencing of dinoflagellate transcriptomes feasible. This dissertation investigates the genomic organization of dinoflagellates to better understand the challenges of assembling dinoflagellate transcriptomic and genomic data from short read sequencing methods, and develops new techniques that utilize deep sequencing data to identify orthologous genes across a diverse set of taxa. To better understand the genomic organization of dinoflagellates, a genomic cosmid clone of the tandemly repeated gene Alchohol Dehydrogenase (AHD) was sequenced and analyzed. The organization of this clone was found to be counter to prevailing hypotheses of genomic organization in dinoflagellates. Further, a new non-canonical splicing motif was described that could greatly improve the automated modeling and annotation of genomic data. A custom phylogenetic marker discovery pipeline, incorporating methods that leverage the statistical power of large data sets was written. A case study on Stramenopiles was undertaken to test the utility in resolving relationships between known groups as well as the phylogenetic affinity of seven unknown taxa. The pipeline generated a set of 373 genes useful as phylogenetic markers that successfully resolved relationships among the major groups of Stramenopiles, and placed all unknown taxa on the tree with strong bootstrap support. This pipeline was then used to discover 668 genes useful as phylogenetic markers in dinoflagellates. Phylogenetic analysis of 58 dinoflagellates, using this set of markers, produced a phylogeny with good support of all branches. The Suessiales were found to be sister to the Peridinales. The Prorocentrales formed a monophyletic group with the Dinophysiales that was sister to the Gonyaulacales. The Gymnodinales was found to be paraphyletic, forming three monophyletic groups. While this pipeline was used to find phylogenetic markers, it will likely also be useful for finding orthologs of interest for other purposes, for the discovery of horizontally transferred genes, and for the separation of sequences in metagenomic data sets.
  • Thumbnail Image
    Item
    Network Algorithms for Complex Systems with Applications to Non-linear Oscillators and Genome Assembly
    (2013) Schmitt, Karl Robert Bruce; Girvan, Michelle; Zimin, Aleksey; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Network and complex system models are useful for studying a wide range of phenomena, from disease spread to traffic flow. Because of the broad applicability of the framework it is important to develop effective simulations and algorithms for complex networks. This dissertation presents contributions to two applied problems in this area First, we study an electro-optical, nonlinear, and time-delayed feedback loop commonly used in applications that require a broad range of chaotic behavior. For this system we detail a discrete-time simulation model, exploring the model's synchronization behavior under specific coupling conditions. Expanding upon already published results that investigated changes in feedback strength, we explore how both time-delay and nonlinear sensitivity impact synchronization. We also relax the requirement of strictly identical systems components to study how synchronization regions are affected when coupled systems have non-identical components (parameters). Last, we allow wider variance in coupling strengths, including unique strengths to each system, to identify a rich synchronization region not previously seen. In our second application, we take a complex networks approach to improving genome assembly algorithms. One key part of sequencing a genome is solving the orientation problem. The orientation problem is finding the relative orientations for each data fragment generated during sequencing. By viewing the genomic data as a network we can apply standard analysis techniques for community finding and utilize the significantly modular structure of the data. This structure informs development and application of two new heuristics based on (A) genetic algorithms and (B) hierarchical clustering for solving the orientation problem. Genetic algorithms allow us to preserve some internal structure while quickly exploring a large solution space. We present studies using a multi-scale genetic algorithm to solve the orientation problem. We show that this approach can be used in conjunction with currently used methods to identify a better solution to the orientation problem. Our hierarchical algorithm further utilizes the modular structure of the data. By progressively solving and merging sub-problems together we pick optimal `local' solutions while allowing more global corrections to occur later. Our results show significant improvements over current techniques for both generated data and real assembly data.
  • Thumbnail Image
    Item
    Computational methods to improve genome assembly and gene prediction
    (2011) Kelley, David Roy; Salzberg, Steven L; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    DNA sequencing is used to read the nucleotides composing the genetic material that forms individual organisms. As 2nd generation sequencing technologies offering high throughput at a feasible cost have matured, sequencing has permeated nearly all areas of biological research. By a combination of large-scale projects led by consortiums and smaller endeavors led by individual labs, the flood of sequencing data will continue, which should provide major insights into how genomes produce physical characteristics, including disease, and evolve. To realize this potential, computer science is required to develop the bioinformatics pipelines to efficiently and accurately process and analyze the data from large and noisy datasets. Here, I focus on two crucial bioinformatics applications: the assembly of a genome from sequencing reads and protein-coding gene prediction. In genome assembly, we form large contiguous genomic sequences from the short sequence fragments generated by current machines. Starting from the raw sequences, we developed software called Quake that corrects sequencing errors more accurately than previous programs by using coverage of k-mers and probabilistic modeling of sequencing errors. My experiments show correcting errors with Quake improves genome assembly and leads to the detection of more polymorphisms in re-sequencing studies. For post-assembly analysis, we designed a method to detect a particular type of mis-assembly where the two copies of each chromosome in diploid genomes diverge. We found thousands of examples in each of the chimpanzee, cow, and chicken public genome assemblies that created false segmental duplications. Shotgun sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to both discover unknown microbes and explore complex environments. We developed software called Scimm that clusters metagenomic sequences based on composition in an unsupervised fashion more accurately than previous approaches. Finally, we extended an approach for predicting protein-coding genes on whole genomes to metagenomic sequences by adding new discriminative features and augmenting the task with taxonomic classification and clustering of the sequences. The program, called Glimmer-MG, predicts genes more accurately than all previous methods. By adding a model for sequencing errors that also allows the program to predict insertions and deletions, accuracy significantly improves on error-prone sequences.
  • Thumbnail Image
    Item
    Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing
    (2009) Langmead, Benjamin Thomas; Salzberg, Steven L; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Improvements in DNA sequencing have both broadened its utility and dramatically increased the size of sequencing datasets. Sequencing instruments are now used regularly as sources of high-resolution evidence for genotyping, methylation profiling, DNA-protein interaction mapping, and characterizing gene expression in the human genome and in other species. With existing methods, the computational cost of aligning short reads from the Illumina instrument to a mammalian genome can be very large: on the order of many CPU months for one human genotyping project. This thesis presents a novel application of the Burrows-Wheeler Transform that enables the alignment of short DNA sequences to mammalian genomes at a rate much faster than existing hashtable-based methods. The thesis also presents an extension of the technique that exploits the scalability of Cloud Computing to perform the equivalent of one human genotyping project in hours.
  • Thumbnail Image
    Item
    FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION
    (2007-11-27) Islamaj, Rezarta; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Sequence classification is an important problem in many real-world applications. Sequence data often contain no explicit "signals," or features, to enable the construction of classification algorithms. Extracting and interpreting the most useful features is challenging, and hand construction of good features is the basis of many classification algorithms. In this thesis, I address this problem by developing a feature-generation algorithm (FGA). FGA is a scalable method for automatic feature generation for sequences; it identifies sequence components and uses domain knowledge, systematically constructs features, explores the space of possible features, and identifies the most useful ones. In the domain of biological sequences, splice-sites are locations in DNA sequences that signal the boundaries between genetic information and intervening non-coding regions. Only when splice-sites are identified with nucleotide precision can the genetic information be translated to produce functional proteins. In this thesis, I address this fundamental process by developing a highly accurate splice-site prediction model that employs our sequence feature-generation framework. The FGA model shows statistically significant improvements over state-of-the-art splice-site prediction methods. So that biologists can understand and interpret the features FGA constructs, I developed SplicePort, a web-based tool for splice-site prediction and analysis. With SplicePort the user can explore the relevant features for splicing, and can obtain splice-site predictions for the sequences based on these features. For an experimental biologist trying to identify the critical sequence elements of splicing, SplicePort offers flexibility and a rich motif exploration functionality, which may help to significantly reduce the amount of experimentation needed. In this thesis, I present examples of the observed feature groups and describe efforts to detect biological signals that may be important for the splicing process. Naturally, FGA can be generalized to other biologically inspired classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, as well as other sequence classification problems, provided we have sufficient knowledge of the new domain.
  • Thumbnail Image
    Item
    COMPUTATIONAL ANALYSES OF MICROBIAL GENOMES - OPERONS, PROTEIN FAMILIES AND LATERAL GENE TRANSFER
    (2005-05-15) Yan, Yongpan; Moult, John; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    As a result of recent successes in genome scale studies, especially genome sequencing, large amounts of new biological data are now available. This naturally challenges the computational world to develop more powerful and precise analysis tools. In this work, three computational studies have been conducted, utilizing complete microbial genome sequences: the detection of operons, the composition of protein families, and the detection of the lateral gene transfer events. In the first study, two computational methods, termed the Gene Neighbor Method (GNM) and the Gene Gap Method (GGM), were developed for the detection of operons in microbial genomes. GNM utilizes the relatively high conservation of order of genes in operons, compared with genes in general. GGM makes use of the relatively short gap between genes in operons compared with that otherwise found between adjacent genes. The two methods were benchmarked using biological pathway data and documented operon data. Operons were predicted for 42 microbial genomes. The predictions are used to infer possible functions for some hypothetical genes in prokaryotic genomes and have proven a useful adjunct to structure information in deriving protein function in our structural genomics project. In the second study, we have developed an automated clustering procedure to classify protein sequences in a set of microbial genomes into protein families. Benchmarking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. The aim of constructing this comprehensive protein family set is to address several questions key to structural genomics. First, our study indicates that approximately 20% of known families with three or more members currently have a representative structure. Second, the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes are sequenced. However, the vast majority of these families will be small. Third, it will be possible to obtain structural templates for 70 - 80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families. The third study is the detection of lateral gene transfer event in microbial genomes. Two new high throughput methods have been developed, and applied to a set of 66 fully sequenced genomes. Both make use of a protein family framework. In the High Apparent Gene Loss (HAGL) method, the number and nature of gene loss events implied by classical evolutionary descent is analyzed. The higher the number of apparent losses, and the smaller the evolutionary distance over which they must have occurred, the more likely that one or more genes have been transferred into the family. The Evolutionary Rate Anomaly (ERA) method associates transfer events with proteins that appear to have an anomalously low rate of sequence change compared with the rest of that protein family. The methods are complementary in that the HAGL method works best with small families and the ERA method best with larger ones. The methods have been parameterized against each other, such that they have high specificity (less than 10% false positives) and can detect about half of the test events. Application to the full set of genomes shows widely varying amounts of lateral gene transfer.