Logic minimization and rule extraction for identification of functional sites in molecular sequences

dc.contributor.authorCruz-Cano, Raul
dc.contributor.authorLee, Mei-Ling Ting
dc.contributor.authorLeung, Ming-Ying
dc.date.accessioned2013-01-10T21:51:40Z
dc.date.available2013-01-10T21:51:40Z
dc.date.issued2012-08-16
dc.description.abstractBackground Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions. Methods In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database. We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach. Results For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies. Conclusions The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples.en_US
dc.description.urihttps://doi.org/10.1186/1756-0381-5-10
dc.identifier.citationCruz-Cano, R., Lee, ML.T. & Leung, MY. Logic minimization and rule extraction for identification of functional sites in molecular sequences. BioData Mining 5, 10 (2012).en_US
dc.identifier.urihttp://hdl.handle.net/1903/13388
dc.language.isoen_USen_US
dc.relation.isAvailableAtEpidemiology & Biostatistics
dc.relation.isAvailableAtSchool of Public Health
dc.relation.isAvailableAtDigital Repository at the University of Maryland (DRUM)
dc.relation.isAvailableAtUniversity of Maryland (College Park, MD)
dc.subjectmolecular sequencesen_US
dc.subjectbioinformaticsen_US
dc.titleLogic minimization and rule extraction for identification of functional sites in molecular sequencesen_US
dc.typeArticleen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Cruz-Cano, et al.pdf
Size:
345.03 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.57 KB
Format:
Item-specific license agreed upon to submission
Description: