Dwarf: A Complete System for Analyzing High-Dimensional Data Sets
Dwarf: A Complete System for Analyzing High-Dimensional Data Sets
Files
Publication or External Link
Date
2004-08-23
Authors
Sismanis, John
Advisor
Roussopoulos, Nick
Citation
DRUM DOI
Abstract
The need for data analysis by different industries, including
telecommunications, retail, manufacturing and financial services, has
generated a flurry of research, highly sophisticated methods and
commercial products. However, all of the current attempts are haunted
by the so-called "high-dimensionality curse"; the complexity of space
and time increases exponentially with the number of analysis
"dimensions". This means that all existing approaches are limited
only to coarse levels of analysis and/or to approximate answers with
reduced precision. As the need for detailed analysis keeps
increasing, along with the volume and the detail of the data that is
stored, these approaches are very quickly rendered unusable. I have
developed a unique method for efficiently performing analysis that is
not affected by the high-dimensionality of data and scales only
polynomially -and almost linearly- with the dimensions without
sacrificing any accuracy in the returned results. I have implemented a
complete system (called "Dwarf") and performed an extensive
experimental evaluation that demonstrated tremendous improvements over
existing methods for all aspects of performing analysis -initial
computation, storing, querying and updating it.
I have extended my research to the "data-streaming" model where
updates are performed on-line, exacerbating any concurrent analysis
but has a very high impact on applications like security, network
management/monitoring router traffic control and sensor networks. I
have devised streaming algorithms that provide complex statistics
within user-specified relative-error bounds over a data stream. I
introduced the class of "distinct implicated statistics", which is
much more general than the established class of "distinct count"
statistics. The latter has been proved invaluable in applications such
as analyzing and monitoring the distinct count of species in a
population or even in query optimization. The "distinct implicated
statistics" class provides invaluable information about the
correlations in the stream and is necessary for applications such as
security. My algorithms are designed to use bounded amounts of memory
and processing -so that they can even be implemented in hardware for
resource-limited environments such as network-routers or sensors- and
also to work in "noisy" environments, where some data may be flawed
either implicitly due to the extraction process or explicitly.