Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seq

dc.contributor.advisorDelwiche, Charles Fen_US
dc.contributor.authorGibbons, Theodore Roberten_US
dc.contributor.departmentCell Biology & Molecular Geneticsen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2016-02-06T06:44:43Z
dc.date.available2016-02-06T06:44:43Z
dc.date.issued2015en_US
dc.description.abstractDinoflagellates are a diverse and ancient lineage of globally abundant algae that have adapted to fill a diverse array of important ecological roles. Despite their importance, dinoflagellate genomes remain relatively poorly understood because of their enormous size. It is suspected that dinoflagellate genomes have expanded through rampant gene duplication, possibly using a lineage-specific mechanism that involves reinsertion of mature transcripts back into the genome, and that may rely on spliced leader trans-splicing for reactivation and processing of recycled transcripts. Draft genomes have recently been published for two extremely small endosymbiotic species. These genomes confirm expansion of nearly 10k gene families, relative to other eukaryotes. In the more complete genome, evidence for transcript recycling based on relict spliced leader sequences was found in over 5,500 genes. Genomic efforts in larger dinoflagellates have focused instead on transcriptome sequencing, but transcriptomes assembled from short-read HTS data contain very little evidence for rampant gene duplication, or for trans-splicing. I have shown that apparent disagreement with hypotheses related to ubiquitous trans-splicing and widespread gene duplication are the result of technological limitations. By leveraging the statistical power of high-throughput sequencing, I found that spliced leader suffixes as short as six nucleotides are sufficient for positive identification. I also found that isoform sequences from families of conserved paralogs are systematically collapsed during assembly, but that many of these consensus sequences can be identified using a custom SNP-calling procedure that can be combined with traditional clustering based on pairwise sequence alignment to obtain a more complete picture of gene duplication in dinoflagellates. Efficient, automated homology detection based on pairwise sequence alignment is an equally challenging problem for which there is much room for improvement. I explored alternative metrics for scoring alignments between sequences using a popular procedure based on BLAST and Markov clustering, and showed that simplified metrics perform as well or better than more popular alternatives. I also found that Markov clustering of protein sequences suffers from a serious false positive problem when compared against manual curation, suggesting that it is more appropriate for pre-clustering of very large data sets than as a complete clustering solution.en_US
dc.identifierhttps://doi.org/10.13016/M2D715
dc.identifier.urihttp://hdl.handle.net/1903/17312
dc.language.isoenen_US
dc.subject.pqcontrolledBioinformaticsen_US
dc.subject.pquncontrolledclusteringen_US
dc.subject.pquncontrolleddinoflagellateen_US
dc.subject.pquncontrolledilluminaen_US
dc.subject.pquncontrolledparalogyen_US
dc.subject.pquncontrolledspliced leaderen_US
dc.subject.pquncontrolledtranscriptomicsen_US
dc.titleInferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seqen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gibbons_umd_0117E_16756.pdf
Size:
11.03 MB
Format:
Adobe Portable Document Format