Mixture Models for Nucleic Acid Sequence Feature Analysis

No Thumbnail Available

Files

Wang_umd_0117E_24248.pdf (17.16 MB)
(RESTRICTED ACCESS)
No. of downloads:

Publication or External Link

Date

2023

Citation

Abstract

Signals in nucleotide sequences play a crucial role in interactions among macromolecules and the regulation of biological functional processes such as transcription, the splicing of messenger RNA precursors and translation. Recognition of signals in nucleotide sequences is the first step in functional annotation, which is critical for the identification of deleterious mutations and the identification of targets for disease treatment. One of the essential steps in gene expression, RNA splicing removes introns from newly transcribed RNA, ligating exons to generate mature RNA. Splicing involves the formation and recycling of the spliceosome, a large macromolecular complex whose assembly requires complex coordination by splicing factors through the recognition of RNA-protein binding sites. One potential method to reveal unknown subtypes of samples and identify distinctively distributed features is by applying a mixture model called the admixture model or Latent Dirichlet Allocation (LDA), which allows samples to have partial memberships of different clusters that can be interpreted for functional motif identification. By applying mixture models to RNA sequences, I found splicing signals such as the polypyrimidine tract and the branch point in intron sequences. Mixture models also showed motifs associated with reading frames from coding sequences, which further revealed potential coding regions from 5’ untranslated regions and long non-coding RNAs.

Dynamic single-molecule imaging of nascent RNAs coupled with multiple genome-wide assays reveals that splicing happens far more often than expected, and partial intron removal can be captured prior to completion of the entire transcript. I hypothesize that the spliceosome progressively removes large introns in small pieces through 'recursive splicing' instead of removing the whole intron at once. However, the sequence features that distinguish sites of recursive splicing from canonical splice sites remain to be discovered. Here, I applied mixture models to sequences from human introns to identify sequence features associated with recursive splicing. This method helped me to recognize and visualize splicing signals from annotated intron sequences and identify potential coding sequences from human 5' untranslated regions and long non-coding RNA. After applying mixture models to the sequences surrounding recursive and canonical splicing sites, I found that transcripts where large introns can be recursively spliced can be distinguished from those without recursive splicing by the presence of CG-rich motifs flanking 5' splice sites upstream of first introns, and the absence of DNA methylation at these sites.In addition to applications of mixture models, I also explored RNA Bind-N-Seq data reflecting the binding activities of the splicing factor U2AF and found that the recursive 3' splice sites have higher U2AF binding affinities than the downstream canonical 3'SS.

The observations suggest that, first, mixture models have the potential to identify functional motifs, including subtle signals in sequences such as the branch sites that only occur in a subgroup of introns. Second, the usage of recursive splicing sites is associated with sequence features in the first exons of the transcripts, suggesting a testable model for the regulation of recursive splicing in human introns.

Notes

Rights