A framework for discovering meaningful associations in the annotated life sciences Web

Lee, Woei-jyh

A framework for discovering meaningful associations in the annotated life sciences Web

dc.contributor.advisor	Raschid, Louiqa	en_US
dc.contributor.advisor	Tseng, Chau-Wen	en_US
dc.contributor.author	Lee, Woei-jyh	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2009-07-02T05:58:01Z
dc.date.available	2009-07-02T05:58:01Z
dc.date.issued	2009	en_US
dc.description.abstract	During the last decade, life sciences researchers have gained access to the entire human genome, reliable high-throughput biotechnologies, affordable computational resources, and public network access. This has produced vast amounts of data and knowledge captured in the life sciences Web, and has created the need for new tools to analyze this knowledge and make discoveries. Consider a simplified Web of three publicly accessible data resources Entrez Gene, PubMed and OMIM. Data records in each resource are annotated with terms from multiple controlled vocabularies (CVs). The links between data records in two resources form a relationship between the two resources. Thus, a record in Entrez Gene, annotated with GO terms, can have links to multiple records in PubMed that are annotated with MeSH terms. Similarly, OMIM records annotated with terms from SNOMED CT may have links to records in Entrez Gene and PubMed. This forms a rich web of annotated data records. The objective of this research is to develop the Life Science Link (<italic>LSLink</italic>) methodology and tools to discover meaningful patterns across resources and CVs. In a first step, we execute a protocol to follow links, extract annotations, and generate datasets of termlinks, which consist of data records and CV terms. We then mine the termlinks of the datasets to find potentially meaningful associations between pairs of terms from two CVs. Biologically meaningful associations of pairs of CV terms may yield innovative nuggets of previously unknown knowledge. Moreover, the bridge of associations across CV terms will reflect the practice of how scientists annotate data across linked data repositories. Contributions include a methodology to create background datasets, metrics for mining patterns, applying semantic knowledge for generalization, tools for discovery, and validation with biological use cases. Inspired by research in association rule mining and linkage analysis, we develop two metrics to determine support and confidence scores in the associations of pairs of CV terms. Associations that have a statistically significant high score and are biologically meaningful may lead to new knowledge. To further validate the support and confidence metrics, we develop a secondary test for significance based on the hypergeometric distribution. We also exploit the semantics of the CVs. We aggregate <italic>termlinks</italic> over siblings of a common parent CV term and use them as additional evidence to boost the support and confidence scores in the associations of the parent CV term. We provide a simple discovery interface where biologists can review associations and their scores. Finally, a cancer informatics use case validates the discovery of associations between human genes and diseases.	en_US
dc.format.extent	8426891 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1903/9223
dc.language.iso	en_US
dc.subject.pqcontrolled	Computer Science	en_US
dc.subject.pqcontrolled	Biology, Bioinformatics	en_US
dc.subject.pquncontrolled	annotations	en_US
dc.subject.pquncontrolled	associations	en_US
dc.subject.pquncontrolled	hypergeometric distribution	en_US
dc.subject.pquncontrolled	Life Science Link (LSLink)	en_US
dc.subject.pquncontrolled	mining in life sciences	en_US
dc.subject.pquncontrolled	support and confidence scores	en_US
dc.title	A framework for discovering meaningful associations in the annotated life sciences Web	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Lee_umd_0117E_10090.pdf
Size:: 8.04 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations