Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seq

Gibbons, Theodore Robert

Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seq

dc.contributor.advisor	Delwiche, Charles F	en_US
dc.contributor.author	Gibbons, Theodore Robert	en_US
dc.contributor.department	Cell Biology & Molecular Genetics	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2016-02-06T06:44:43Z
dc.date.available	2016-02-06T06:44:43Z
dc.date.issued	2015	en_US
dc.description.abstract	Dinoflagellates are a diverse and ancient lineage of globally abundant algae that have adapted to fill a diverse array of important ecological roles. Despite their importance, dinoflagellate genomes remain relatively poorly understood because of their enormous size. It is suspected that dinoflagellate genomes have expanded through rampant gene duplication, possibly using a lineage-specific mechanism that involves reinsertion of mature transcripts back into the genome, and that may rely on spliced leader trans-splicing for reactivation and processing of recycled transcripts. Draft genomes have recently been published for two extremely small endosymbiotic species. These genomes confirm expansion of nearly 10k gene families, relative to other eukaryotes. In the more complete genome, evidence for transcript recycling based on relict spliced leader sequences was found in over 5,500 genes. Genomic efforts in larger dinoflagellates have focused instead on transcriptome sequencing, but transcriptomes assembled from short-read HTS data contain very little evidence for rampant gene duplication, or for trans-splicing. I have shown that apparent disagreement with hypotheses related to ubiquitous trans-splicing and widespread gene duplication are the result of technological limitations. By leveraging the statistical power of high-throughput sequencing, I found that spliced leader suffixes as short as six nucleotides are sufficient for positive identification. I also found that isoform sequences from families of conserved paralogs are systematically collapsed during assembly, but that many of these consensus sequences can be identified using a custom SNP-calling procedure that can be combined with traditional clustering based on pairwise sequence alignment to obtain a more complete picture of gene duplication in dinoflagellates. Efficient, automated homology detection based on pairwise sequence alignment is an equally challenging problem for which there is much room for improvement. I explored alternative metrics for scoring alignments between sequences using a popular procedure based on BLAST and Markov clustering, and showed that simplified metrics perform as well or better than more popular alternatives. I also found that Markov clustering of protein sequences suffers from a serious false positive problem when compared against manual curation, suggesting that it is more appropriate for pre-clustering of very large data sets than as a complete clustering solution.	en_US
dc.identifier	https://doi.org/10.13016/M2D715
dc.identifier.uri	http://hdl.handle.net/1903/17312
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Bioinformatics	en_US
dc.subject.pquncontrolled	clustering	en_US
dc.subject.pquncontrolled	dinoflagellate	en_US
dc.subject.pquncontrolled	illumina	en_US
dc.subject.pquncontrolled	paralogy	en_US
dc.subject.pquncontrolled	spliced leader	en_US
dc.subject.pquncontrolled	transcriptomics	en_US
dc.title	Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seq	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gibbons_umd_0117E_16756.pdf
Size:: 11.03 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Cell Biology & Molecular Genetics Theses and Dissertations