INVESTIGATION OF SOME POSSIBLE ORIGINS OF PROTEIN FAMILIES
MetadataShow full item record
ABSTRACT Title of Document: INVESTIGATION OF SOME POSSIBLE ORIGINS OF PROTEIN FAMILIES Nuttinee Teerakulkittipong, Ph.D., 2013 Directed By: Professor John Moult, Institute for Bioscience and Biotechnology Research Department of Cell Biology and Molecular Genetics The prevailing view of the evolutionary history of proteins has been that all protein domains are descendents of distinct evolutionary lines, and that these lines are all relatively ancient families. The primary basis for that view was that known protein structures could be grouped by similarity of topology into a small number of folds. However, two lines of evidence challenge that view of protein evolution. First, analysis of sequence relationships within and between sets of complete genomes has established that a large proportion of protein sequence families are narrowly distributed in phylogenetic space and so appear to be relatively recent in origin. Second, analysis of the relationship between known protein structures shows that there are many more than a 1000 distinct folds, appearing to imply many more evolutionary lines. There are four hypotheses for the discrepancy between the traditional view and the observed structural and sequence distributions within protein families. Specifically, these are that apparently young protein families may arise from (1) previously non-coding DNA, or frame-shifted from existing coding sequence, (2) recombination of structural fragments between proteins or recombination with non-coding DNA, (3) older families where the rapid rate of sequence change makes relatives hard to detect, and (4) lateral gene transfer (LGT) from other organisms. In the investigation of these hypotheses, phylogenetic analysis provides a means of estimating the relative age of protein families and of detecting lateral gene transfer effects. Phylogeny based investigation of prokaryotic species divergence has generally been performed using a small number of families resulting in significant bias that affects age analysis. Therefore, we decided to use information from many protein families for constructing a species tree, utilizing a new procedure for combining these diverse sources. The resulting tree for 66 Prokaryotic species incorporates information from 1,379 protein families. The families were selected on the basis of consistent family evolutionary rates obtained using three different methods. Noise resistant methods were used to combat the effects of lateral gene transfer and some inevitable errors in protein sequence alignment and identification of orthologous families. Most topological features of the tree are robust as assessed by bootstrap testing, and previous distortions of inter-kingdom distances and poor determination of short branch lengths have been corrected. The tree is used to obtain estimates of the age of all protein families, key to the investigation of all four hypotheses. Proteins affected by LGT events were detected using a previously developed method, and removed before the age calculation. We used the estimated family ages obtained from the phylogenetic analysis to examine five properties of proteins as a function of the age of the corresponding families. The goal here is to ascertain whether the age dependence of these properties supports hypotheses (1) and (2) for the origin of apparently young families - that is, these are truly new open reading frames. The five properties are the mRNA expression level, relative evolutionary rate, predicted percentage of structural disorder, number of protein interaction partners and codon composition bias. The results are consistent with the new open reading frame model: Expression is found to increase substantially as a function of family age, suggesting that young proteins are not yet adapted sufficiently to tolerate high concentration conditions. The rate of change of amino acid change is faster for young proteins, consistent with overall positive selection for improved structural and functional properties. The fraction of predicted disorder is highest in the youngest proteins, consistent with immature structural properties. The number of known protein-protein interactions increases steadily with age, with low levels for young proteins, suggesting an ongoing process of increasing functional complexity. Analysis of these four factors is reported in Chapter 3. Results for the final factor, codon compositional bias, are reported in Chapter 4. Here we found that the codon composition of young proteins is markedly different from that of old proteins and similar to that of proteins constructed with random codon assignment. Thus the results are consistent with a model of many young proteins having newly formed open reading frames, and that during the subsequent evolution process, the codon composition is gradually optimized to fit the specific genomic conditions of the organism concerned. Overall, results for all five properties lend statistical support to the new open reading frame hypotheses. Further investigation is needed however. In particular, examination of the structural properties of young proteins, such as super-secondary structure composition and the distribution of use of rare and common structural fragments, should be useful.