FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION

Islamaj, Rezarta

FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION

dc.contributor.advisor	Getoor, Lise	en_US
dc.contributor.author	Islamaj, Rezarta	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2008-04-22T16:06:49Z
dc.date.available	2008-04-22T16:06:49Z
dc.date.issued	2007-11-27	en_US
dc.description.abstract	Sequence classification is an important problem in many real-world applications. Sequence data often contain no explicit "signals," or features, to enable the construction of classification algorithms. Extracting and interpreting the most useful features is challenging, and hand construction of good features is the basis of many classification algorithms. In this thesis, I address this problem by developing a feature-generation algorithm (FGA). FGA is a scalable method for automatic feature generation for sequences; it identifies sequence components and uses domain knowledge, systematically constructs features, explores the space of possible features, and identifies the most useful ones. In the domain of biological sequences, splice-sites are locations in DNA sequences that signal the boundaries between genetic information and intervening non-coding regions. Only when splice-sites are identified with nucleotide precision can the genetic information be translated to produce functional proteins. In this thesis, I address this fundamental process by developing a highly accurate splice-site prediction model that employs our sequence feature-generation framework. The FGA model shows statistically significant improvements over state-of-the-art splice-site prediction methods. So that biologists can understand and interpret the features FGA constructs, I developed SplicePort, a web-based tool for splice-site prediction and analysis. With SplicePort the user can explore the relevant features for splicing, and can obtain splice-site predictions for the sequences based on these features. For an experimental biologist trying to identify the critical sequence elements of splicing, SplicePort offers flexibility and a rich motif exploration functionality, which may help to significantly reduce the amount of experimentation needed. In this thesis, I present examples of the observed feature groups and describe efforts to detect biological signals that may be important for the splicing process. Naturally, FGA can be generalized to other biologically inspired classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, as well as other sequence classification problems, provided we have sufficient knowledge of the new domain.	en_US
dc.format.extent	3406689 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1903/7745
dc.language.iso	en_US
dc.subject.pqcontrolled	Computer Science	en_US
dc.subject.pquncontrolled	machine learning	en_US
dc.subject.pquncontrolled	data mining	en_US
dc.subject.pquncontrolled	bioinformatics	en_US
dc.subject.pquncontrolled	splice-site prediction	en_US
dc.subject.pquncontrolled	sequence-motif analysis	en_US
dc.title	FEATURE GENERATION AND ANALYSIS APPLIED TO SEQUENCE CLASSIFICATION FOR SPLICE-SITE PREDICTION	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: umi-umd-5026.pdf
Size:: 3.25 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations