Thumbnail Image
umi-umd-2490.pdf(6.2 MB)
No. of downloads: 4185
Publication or External Link
Yan, Yongpan
Moult, John
As a result of recent successes in genome scale studies, especially genome sequencing, large amounts of new biological data are now available. This naturally challenges the computational world to develop more powerful and precise analysis tools. In this work, three computational studies have been conducted, utilizing complete microbial genome sequences: the detection of operons, the composition of protein families, and the detection of the lateral gene transfer events. In the first study, two computational methods, termed the Gene Neighbor Method (GNM) and the Gene Gap Method (GGM), were developed for the detection of operons in microbial genomes. GNM utilizes the relatively high conservation of order of genes in operons, compared with genes in general. GGM makes use of the relatively short gap between genes in operons compared with that otherwise found between adjacent genes. The two methods were benchmarked using biological pathway data and documented operon data. Operons were predicted for 42 microbial genomes. The predictions are used to infer possible functions for some hypothetical genes in prokaryotic genomes and have proven a useful adjunct to structure information in deriving protein function in our structural genomics project. In the second study, we have developed an automated clustering procedure to classify protein sequences in a set of microbial genomes into protein families. Benchmarking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. The aim of constructing this comprehensive protein family set is to address several questions key to structural genomics. First, our study indicates that approximately 20% of known families with three or more members currently have a representative structure. Second, the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes are sequenced. However, the vast majority of these families will be small. Third, it will be possible to obtain structural templates for 70 - 80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families. The third study is the detection of lateral gene transfer event in microbial genomes. Two new high throughput methods have been developed, and applied to a set of 66 fully sequenced genomes. Both make use of a protein family framework. In the High Apparent Gene Loss (HAGL) method, the number and nature of gene loss events implied by classical evolutionary descent is analyzed. The higher the number of apparent losses, and the smaller the evolutionary distance over which they must have occurred, the more likely that one or more genes have been transferred into the family. The Evolutionary Rate Anomaly (ERA) method associates transfer events with proteins that appear to have an anomalously low rate of sequence change compared with the rest of that protein family. The methods are complementary in that the HAGL method works best with small families and the ERA method best with larger ones. The methods have been parameterized against each other, such that they have high specificity (less than 10% false positives) and can detect about half of the test events. Application to the full set of genomes shows widely varying amounts of lateral gene transfer.