Dwarf: A Complete System for Analyzing High-Dimensional Data Sets

Thumbnail Image
umi-umd-1762.pdf(1.08 MB)
No. of downloads: 2092
Publication or External Link
Sismanis, John
Roussopoulos, Nick
The need for data analysis by different industries, including telecommunications, retail, manufacturing and financial services, has generated a flurry of research, highly sophisticated methods and commercial products. However, all of the current attempts are haunted by the so-called "high-dimensionality curse"; the complexity of space and time increases exponentially with the number of analysis "dimensions". This means that all existing approaches are limited only to coarse levels of analysis and/or to approximate answers with reduced precision. As the need for detailed analysis keeps increasing, along with the volume and the detail of the data that is stored, these approaches are very quickly rendered unusable. I have developed a unique method for efficiently performing analysis that is not affected by the high-dimensionality of data and scales only polynomially -and almost linearly- with the dimensions without sacrificing any accuracy in the returned results. I have implemented a complete system (called "Dwarf") and performed an extensive experimental evaluation that demonstrated tremendous improvements over existing methods for all aspects of performing analysis -initial computation, storing, querying and updating it. I have extended my research to the "data-streaming" model where updates are performed on-line, exacerbating any concurrent analysis but has a very high impact on applications like security, network management/monitoring router traffic control and sensor networks. I have devised streaming algorithms that provide complex statistics within user-specified relative-error bounds over a data stream. I introduced the class of "distinct implicated statistics", which is much more general than the established class of "distinct count" statistics. The latter has been proved invaluable in applications such as analyzing and monitoring the distinct count of species in a population or even in query optimization. The "distinct implicated statistics" class provides invaluable information about the correlations in the stream and is necessary for applications such as security. My algorithms are designed to use bounded amounts of memory and processing -so that they can even be implemented in hardware for resource-limited environments such as network-routers or sensors- and also to work in "noisy" environments, where some data may be flawed either implicitly due to the extraction process or explicitly.