Dwarf: A Complete System for Analyzing High-Dimensional Data Sets

Sismanis, John

Dwarf: A Complete System for Analyzing High-Dimensional Data Sets

dc.contributor.advisor	Roussopoulos, Nick	en_US
dc.contributor.author	Sismanis, John	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2004-10-09T05:17:50Z
dc.date.available	2004-10-09T05:17:50Z
dc.date.issued	2004-08-23	en_US
dc.description.abstract	The need for data analysis by different industries, including telecommunications, retail, manufacturing and financial services, has generated a flurry of research, highly sophisticated methods and commercial products. However, all of the current attempts are haunted by the so-called "high-dimensionality curse"; the complexity of space and time increases exponentially with the number of analysis "dimensions". This means that all existing approaches are limited only to coarse levels of analysis and/or to approximate answers with reduced precision. As the need for detailed analysis keeps increasing, along with the volume and the detail of the data that is stored, these approaches are very quickly rendered unusable. I have developed a unique method for efficiently performing analysis that is not affected by the high-dimensionality of data and scales only polynomially -and almost linearly- with the dimensions without sacrificing any accuracy in the returned results. I have implemented a complete system (called "Dwarf") and performed an extensive experimental evaluation that demonstrated tremendous improvements over existing methods for all aspects of performing analysis -initial computation, storing, querying and updating it. I have extended my research to the "data-streaming" model where updates are performed on-line, exacerbating any concurrent analysis but has a very high impact on applications like security, network management/monitoring router traffic control and sensor networks. I have devised streaming algorithms that provide complex statistics within user-specified relative-error bounds over a data stream. I introduced the class of "distinct implicated statistics", which is much more general than the established class of "distinct count" statistics. The latter has been proved invaluable in applications such as analyzing and monitoring the distinct count of species in a population or even in query optimization. The "distinct implicated statistics" class provides invaluable information about the correlations in the stream and is necessary for applications such as security. My algorithms are designed to use bounded amounts of memory and processing -so that they can even be implemented in hardware for resource-limited environments such as network-routers or sensors- and also to work in "noisy" environments, where some data may be flawed either implicitly due to the extraction process or explicitly.	en_US
dc.format.extent	1135359 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1903/1876
dc.language.iso	en_US
dc.subject.pqcontrolled	Computer Science	en_US
dc.subject.pquncontrolled	OLAP	en_US
dc.subject.pquncontrolled	Data Warehouse	en_US
dc.subject.pquncontrolled	Warehousing	en_US
dc.subject.pquncontrolled	Cube	en_US
dc.subject.pquncontrolled	Analysis	en_US
dc.subject.pquncontrolled	High Dimensionality;	en_US
dc.title	Dwarf: A Complete System for Analyzing High-Dimensional Data Sets	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: umi-umd-1762.pdf
Size:: 1.08 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations