Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
3 results
Search Results
Item Applications of Graph Segmentation Algorithms For Quantitative Genomic Analyses(2020) Gunady, Mohamed; Bravo, Hector C; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)There is a growing interest in utilizing graph formulations and graph-based algorithms in different subproblems of genomic analysis. Since graphs provide a natural and efficient representation of sequences of data where some structural relationships are observed within the data, we study some graph applications in quantitative analysis of typical RNA-seq and Whole Genome Sequencing pipelines. Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. This work describes Yanagi, a tool for segmenting transcriptomes to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves transcriptome substrings and structural relationships between transcripts while eliminating unnecessary sequence duplications. First, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries. The resulting segment sequences can be used with pseudo-alignment tools to quantify gene expression and alternative splicing at the segment level and provide gene-level visualization of the segments for more interpretability. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses. Furthermore, we show how transcriptome quantification can be performed from segment-level statistics. We present an EM algorithm that uses segment counts as features to estimate transcripts relative abundances in a way that maximizes the likelihood of the observed sequenced data. Then we tackle the problem of quantification in an incomplete annotation setting. We propose an assembly-free correction procedure that reduces bias in the estimated abundances of the annotated transcripts caused by the presence of unannotated transcripts in an RNA-seq sample, while avoiding the need to assemble the missing transcripts first. Another use case of our graph segmentation approach is representing population reference genome graphs used in Whole Genome Sequencing (WGS), which can be crucial for some genomic analysis studying highly polymorphic genes like HLA. Usually graph-based aligners are slow and computationally demanding. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs. Lastly, we explore the use of Generative Adversarial Networks (GANs) for imputing the sparse and noisy expression data obtained in single cell RNA sequencing (scRNA-seq) experiments. scRNA-seq provides a rich view into the heterogeneity underlying a cell population which is usually lost when performing bulk RNA-seq. However, these datasets are usually noisy and very sparse, and a number of methods have been proposed to impute zeros in these datasets with the goal of improving downstream analysis. In this work, we propose an approach, scGAIN, to impute zero counts of dropout genes in single cell data using Generative Adversarial Networks (GANs) by learning an approximation of the data distribution. The work presented here discusses an approach to adopt GAIN, a GAN model developed to impute data in image data, into the domain of imputing single cell data. Experiments show that scGAIN gives competitive results compared to the state-of-the-art imputation approaches while showing superiority in various aspects in simulation and real data. Imputation by scGAIN successfully recovers the underlying clustering of cell sub-populations, provides sharp estimates around true mean expression, reducing variability in the data, and increases the correspondence with matched bulk RNA-seq experiments.Item Characterization of Arabidopsis thaliana SR protein genes: mutations, alternative splicing, and ESE selection(2007-06-07) edmonds, jason matthew; Mount, Stephen M; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)RNA processing in eukaryotes is a highly complex process requiring numerous steps and factors that can play roles in the regulation of functional protein production. SR proteins are a well-defined family of splicing factors identified by a conserved RNA Recognition Motif (RRM) and carboxyl-terminal arginine/serine (RS) repeats. SR proteins are known to bind to mRNA precursors via Exonic Splicing Enhancers, and to recruit U2AF and the U1 snRNP to promote splicing. I have identified mutations in five Arabidopsis thaliana SR protein genes that result in altered phenotypes. Two (scl28-1 and srp31-1) result in embryonic lethal phenotypes, while three others (sc35-1, sr45-1, and srp30-1) result in viable and fertile plants with a range of phenotypes. I have also found that mutations in individual SR protein genes can effect the ability of a specific sequence to act as an ESE and hence affect splicing efficiency. Because 16 of the 20 Arabidopsis thaliana SR proteins themselves are alternatively spliced, I have looked for cross regulation using RT-PCR analysis of isoform accumulation in alternatively spliced SR protein genes. I found that SR proteins do, in fact, regulate the alternative splicing of gene targets and do so in both a gene and a tissue specific manner. In order to begin to fully understand the relationship between individual SR proteins it is essential to know when and where they are expressed throughout development. I have studied the expression pattern of 16 of the 20 SR proteins in the roots of wild-type plants as well as sc35-1, srp30-1, and sr45-1 mutants. I have identified both spatial and temporal expression patterns for these 16 proteins relative to specific tissues that compose the root.Item BIOINFORMATIC ANALYSIS OF THE FUNCTIONAL AND STRUCTURAL IMPLICATIONS OF ALTERNATIVE SPLICING.(2007-01-23) Melamud, Eugene; Moult, John; Molecular and Cell Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In higher Eukaryotes, upon transcription of a gene, a complex set of reactions take place to remove fragments of a sequence (introns) from transcribed RNA. A large macro-molecular machine (the spliceosome) recognizes the ends of introns, brings ends into close proximity and catalyzes the splicing reaction. The selection of the location of the ends of introns (splice sites) determines the final message produced at the end of the process. In some cases, an alternative set of splice sites are chosen, and as a consequence different message is produced. This phenomenon is known as alternative splicing. It is now realized that nearly every Human gene undergoes alternative splicing, producing large variability in types and number of transcripts produced. In this thesis, we examine the functional and structural consequences of alternative splicing on proteins, we look into the mechanism of formation of complex splicing patterns, and examine the role of noise in the process.