Describing and Modeling Repetitive Sequences in DNA
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
A significant fraction of the genome, i.e. the complete DNA sequence, of most organisms is comprised of sequences for which there are similar copies somewhere within the genome. While most repetitive DNA was originally thought to have no function, there is a growing body of literature to suggest that repetitive sequences are vital to the genome.
The goal of this dissertation is to analyze statistical properties of repetitive sequences in the genomes of a variety of organisms. We find a variety of striking features of repetitive sequence in the human genome and the genomes of C. elegans (worm), A. thaliana (mustard seed) and D. melanogaster (fruit fly) with some comparison to S. cerevisiae (yeast) and E. coli (a bacteria). We find that the number of times each 40-mer (sequence of 40 bases) occurs in a genome is approximated by a power law distribution. We analyze in detail the separation between copies of 40-mers that occur exactly twice in a chromosome and observe that a significant portion of these pairs, that we call "proximal", have extremely small separations, while the remaining "distant" pairs have a distribution more consistent with being uniformly distributed throughout the chromosome. We introduce a type of exactly repetitive region, which we call a "repeat string," and find the distribution of lengths of repeat strings is roughly a power law.
Since these properties have been verified for the genomes of a variety of organisms there may be a common explanation of their origin. When possible, we suggest evolutionary mechanisms that could cause the emergence of such statistical properties. In particular, we developed a model of the evolution of repeat strings in a genome. We find that, under quite general conditions, the stationary distribution of our evolutionary model is the Pareto distribution, a close relative of the power law distribution.