Representing and Querying Correlated Tuples in Probabilistic Databases
Publication or External Link
Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an appropriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental evaluation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets.