Quality-Aware Data Source Management

Thumbnail Image

Publication or External Link





Data is becoming a commodity of tremendous value in many domains. The ease of collecting and publishing data has led to an upsurge in the number of available data sources --- sources that are highly heterogeneous in the domains they cover, the quality of data they provide, and the fees they charge for accessing their data. However, most existing data integration approaches, for combining information from a collection of sources, focus on facilitating integration itself but are agnostic to the actual utility or the quality of the integration result. These approaches do not optimize for the trade-off between the utility and the cost of integration to determine which sources are worth integrating.

In this dissertation, I introduce a framework for quality-aware data source management. I define a collection of formal quality metrics for different types of data sources, including sources that provide both structured and unstructured data. I develop techniques to efficiently detect the content focus of a large number of diverse sources, to reason about their content changes over time and to formally compute the utility obtained when integrating subsets of them. I also design efficient algorithms with constant factor approximation guarantees for finding a set of sources that maximizes the utility of the integration result given a cost budget. Finally, I develop a prototype quality-aware data source management system and demonstrate the effectiveness of the developed techniques on real-world applications.