Enabling Graph Analysis Over Relational Databases

Thumbnail Image

Publication or External Link





Complex interactions and systems can be modeled by analyzing the connections between underlying entities or objects described by a dataset. These relationships form networks (graphs), the analysis of which has been shown to provide tremendous value in areas ranging from retail to many scientific domains. This value is obtained by using various methodologies from network science-- a field which focuses on studying network representations in the real world. In particular "graph algorithms", which iteratively traverse a graph's connections, are often leveraged to gain insights. To take advantage of the opportunity presented by graph algorithms, there have been a variety of specialized graph data management systems, and analysis frameworks, proposed in recent years, which have made significant advances in efficiently storing and analyzing graph-structured data.

Most datasets however currently do not reside in these specialized systems but rather in general-purpose relational database management systems (RDBMS). A relational or similarly structured system is typically governed by a schema of varying strictness that implements constraints and is meticulously designed for the specific enterprise. Such structured datasets contain many relationships between the entities therein, that can be seen as latent or "hidden" graphs that exist inherently inside the datasets. However, these relationships can only typically be traversed via conducting expensive JOINs using SQL or similar languages.

Thus, in order for users to efficiently traverse these latent graphs to conduct analysis, data needs to be transformed and migrated to specialized systems. This creates barriers that hinder and discourage graph analysis; our vision is to break these barriers.

In this dissertation we investigate the opportunities and challenges involved in efficiently leveraging relationships within data stored in structured databases.

First, we present GraphGen, a lightweight software layer that is independent from the underlying database, and provides interfaces for graph analysis of data in RDBMSs. GraphGen is the first such system that introduces an intuitive high-level language for specifying graphs of interest, and utilizes in-memory graph representations to tackle the problems associated with analyzing graphs that are hidden inside structured datasets. We show GraphGen can analyze such graphs in orders of magnitude less memory, and often computation time, while eliminating manual Extract-Transform-Load (ETL) effort.

Second, we examine how in-memory graph representations of RDBMS data can be used to enhance relational query processing. We present a novel, general framework for executing GROUP BY aggregation over conjunctive queries which avoids materialization of intermediate JOIN results, and wrap this framework inside a multi-way relational operator called Join-Agg. We show that Join-Agg can compute aggregates over a class of relational and graph queries using orders of magnitude less memory and computation time.