COMPUTATIONAL MODELING OF THE RELATIONSHIP BETWEEN SNPS AND DISEASE
MetadataShow full item record
We have developed two models, the stability model and the profile model, to identify non-synonymous single base changes (the most common cause of monogenic disease) that have deleterious effects on protein function in vivo. The stability model analyzes the effect of the resulting amino acid change on protein stability by utilizing structural information such as reduction in hydrophobic area and loss of electrostatic interactions. The profile model makes use of the conservation and type of residues observed at a base change position within a protein family. In each model, a machine learning technique, the support vector machine (SVM) was trained on a set of mutations causative of disease, and a control set of non-disease causing mutations. In jack-knifed testing, the stability model identifies 74% of disease mutations, with a false positive rate of 15%; the profile model identifies 80% of disease mutations, with a false positive rate of 10%. Evaluation of a set of in vitro mutagenesis data with the stability model established that the majority of disease mutations affect protein stability by 1 to 3 Kcal/mol. The stability model's effective distinction between disease and non-disease variants strongly supports the hypothesis that loss of protein stability is a major factor contributing to monogenic disease. Both models are used to identify deleterious SNPs in the human population. After carefully controlling of errors, we find that approximately one-fourth of the known non-synonymous SNPs are deleterious, thus providing a set of possible SNPs contributing to human complex disease traits. A web resource has been developed to provide information on disease/gene relationships at the molecular level. The resource has three primary modules. The first module is used to publish the deleterious SNPs identified by the two above-mentioned models. The second module identifies the candidate genes for a specific disease, and the third module provides information about the relationships between the sets of candidate genes. Disease/candidate gene relationships and gene-gene relationships are derived from the literature using a simple but effective text profiling method.