Application of advanced machine learning strategies for biomedical research

Thumbnail Image


Publication or External Link





Biomedical research delves deeply into understanding individual health and disease mechanisms. Recent advancements in technologies have further transformed the field with large-scale data sets, enabling data-driven approaches to identify important patterns and relationships from large data sets. However, these data sets are often noisy and unstructured. Moreover, missing values and high dimensionality further complicate the analysis processes aimed at yielding meaningful results. With examples in ocular diseases and malaria, this dissertation presents novel strategies employing machine learning to tackle some of the challenges in biomedical research. In ocular diseases, sustained ocular drug delivery is critical to retain therapeutic levels and improve patient adherence to dosing schedules. To enhance the sustained delivery system, we engineer peptide sequences as an adapter to impart desired properties to ocular drugs. Specifically, we develop machine learning models separately for three properties–melanin binding, cell-penetration, and non-toxicity. We employ data reduction techniques to reduce the number of features while maintaining the machine learning model performance and apply interpretable machine learning techniques to explain model predictions on the three properties. Experimental validation in rabbits show two-fold increase in drug retention time with the selected peptide candidate. The developed machine learning framework can be further tailored to engineer other properties in molecular sequences with a wide variety of potential in biomedical applications. Malaria is an infectious disease caused by protozoan of the genus Plasmodium and has been a burden in global health. Developing malaria vaccines is challenging due to the diversity in parasite antigen sequences, which may lead to immune escape. To facilitate the vaccine development process, we leverage the wealth of systems data collected from various sources. For facile data management, a database is constructed to store the structured data processed from the results of the bioinformatics tools. Due to the small fraction of Plasmodium proteins labeled as known antigens, and the remaining proteins unknown of being antigens or non-antigens, a positive-unlabeled machine learning method is applied to identify potential vaccine antigen candidates. Beyond malaria, our approach provides a promising framework for identifying and prioritizing vaccine antigen candidates for a broad range of disease pathogens.