Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
4 results
Search Results
Item Enhancing Modern Query Federation Systems: Execution Optimization, Performance Prediction, and Systems Assessment(2024) Song, Chujun; Abadi, Daniel; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Modern applications distribute operational data across various storage systems in different locations, simplifying application development but complicating data analytics. The prevalent solution has been to use an ETL (Extract, Transform, Load) process to consolidate data from different locations into a centralized data warehouse for further analytical processing. However, this method is computationally intensive, error-prone, compromises data freshness, and creates a scalability bottleneck in performing analytical tasks within such a centralized data warehouse. In contrast, query federation offers a promising alternative by allowing direct analysis of data in its original location, thereby bypassing the ETL process and avoiding its drawbacks. However, current query federation systems are still far from optimal. We describe our work on enhancing modern query federation systems in three aspects: systems assessment, execution optimization, and performance prediction. Systems Assessment: The concept of query federation is not new, with early implementations such as Mariposa paving the way. Modern systems, including Presto, Trino, and Spark, have further developed and refined this feature, significantly enhancing its functionality and efficiency. Despite these advancements, best practices for designing and implementing these systems remain largely unexplored. To address this gap, we introduce a benchmark specifically designed to evaluate the effects of various design strategies on desirable features of query federation systems. It assesses five representative systems, each employing different design strategies, against this benchmark. This part of the work identifies key bottlenecks in different designs, examines the impact of various query optimization and execution strategies, and explores optimal practices for designing the interface between the execution engine and data sources in query federation systems. Additionally, mitigation strategies for these identified bottlenecks are proposed. Execution Optimization: Among the many design choices in query federation systems, we delve deeper into the efficient workload assignment strategy between the query federation system and the data sources, considering that data sources may also possess query processing capabilities. In response to a query, current approaches typically follow one of two paradigms: they either push as much computation as possible down to the underlying system or treat the underlying system solely as a storage engine, attempting to execute the majority of the query processing work within the query federation system itself. We have observed that these approaches tend to result in CPU underutilization on one side—either within the query federation engine or at the data sources. To tackle this inefficiency, we have developed algorithms capable of adjusting the workload distribution either statically or dynamically on both sides to optimize CPU usage and reduce end-to-end query execution latency. Performance Prediction: Accurate and expedient performance estimation before query execution is vital for tasks such as index recommendation, query optimization, query tuning, and cluster management. However, this task continues to pose significant challenges for query federation systems, which typically integrate computation engines and delegate storage management to the underlying system. This architecture often results in unavailable statistics for some tables, rendering many traditional cost estimation methods—those that rely heavily on detailed statistics—impractical. Furthermore, traditional cost estimation methods frequently encounter substantial errors, particularly in complex queries involving multiple joins. In contrast, machine learning-based approaches offer an alternative strategy by leveraging learned models for cost prediction, which have demonstrated superior performance over traditional methods. However, these models are generally evaluated on synthetic workloads, such as TPC-H, within experimental clusters. Real industrial workloads introduce numerous challenges to the application of such methods. In this segment, we assess these methods using actual industrial workloads from the query federation system deployed at LinkedIn to evaluate the performance of these models. We also introduce a new multi-task learning approach that better utilizes operator-level statistics to improve the accuracy of model prediction. Additionally, we empirically investigate the upper bounds of accuracy achievable by these models.Item Top-K Query Processing in Edge-Labeled Graph Data(2016) Park, Noseong; Subrahmanian, VS; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Edge-labeled graphs have proliferated rapidly over the last decade due to the increased popularity of social networks and the Semantic Web. In social networks, relationships between people are represented by edges and each edge is labeled with a semantic annotation. Hence, a huge single graph can express many different relationships between entities. The Semantic Web represents each single fragment of knowledge as a triple (subject, predicate, object), which is conceptually identical to an edge from subject to object labeled with predicates. A set of triples constitutes an edge-labeled graph on which knowledge inference is performed. Subgraph matching has been extensively used as a query language for patterns in the context of edge-labeled graphs. For example, in social networks, users can specify a subgraph matching query to find all people that have certain neighborhood relationships. Heavily used fragments of the SPARQL query language for the Semantic Web and graph queries of other graph DBMS can also be viewed as subgraph matching over large graphs. Though subgraph matching has been extensively studied as a query paradigm in the Semantic Web and in social networks, a user can get a large number of answers in response to a query. These answers can be shown to the user in accordance with an importance ranking. In this thesis proposal, we present four different scoring models along with scalable algorithms to find the top-k answers via a suite of intelligent pruning techniques. The suggested models consist of a practically important subset of the SPARQL query language augmented with some additional useful features. The first model called Substitution Importance Query (SIQ) identifies the top-k answers whose scores are calculated from matched vertices' properties in each answer in accordance with a user-specified notion of importance. The second model called Vertex Importance Query (VIQ) identifies important vertices in accordance with a user-defined scoring method that builds on top of various subgraphs articulated by the user. Approximate Importance Query (AIQ), our third model, allows partial and inexact matchings and returns top-k of them with a user-specified approximation terms and scoring functions. In the fourth model called Probabilistic Importance Query (PIQ), a query consists of several sub-blocks: one mandatory block that must be mapped and other blocks that can be opportunistically mapped. The probability is calculated from various aspects of answers such as the number of mapped blocks, vertices' properties in each block and so on and the most top-k probable answers are returned. An important distinguishing feature of our work is that we allow the user a huge amount of freedom in specifying: (i) what pattern and approximation he considers important, (ii) how to score answers - irrespective of whether they are vertices or substitution, and (iii) how to combine and aggregate scores generated by multiple patterns and/or multiple substitutions. Because so much power is given to the user, indexing is more challenging than in situations where additional restrictions are imposed on the queries the user can ask. The proposed algorithms for the first model can also be used for answering SPARQL queries with ORDER BY and LIMIT, and the method for the second model also works for SPARQL queries with GROUP BY, ORDER BY and LIMIT. We test our algorithms on multiple real-world graph databases, showing that our algorithms are far more efficient than popular triple stores.Item Real-time analytics on large dynamic graphs(2015) Mondal, Jayanta; Deshpande, Amol; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In today's fast-paced and interconnected digital world, the data generated by an increasing number of applications is being modeled as dynamic graphs. The graph structure encodes relationships among data items, while the structural changes to the graphs as well as the continuous stream of information produced by the entities in these graphs make them dynamic in nature. Examples include social networks where users post status updates, images, videos, etc.; phone call networks where nodes may send text messages or place phone calls; road traffic networks where the traffic behavior of the road segments changes constantly, and so on. There is a tremendous value in storing, managing, and analyzing such dynamic graphs and deriving meaningful insights in real-time. However, a majority of the work in graph analytics assumes a static setting, and there is a lack of systematic study of the various dynamic scenarios, the complexity they impose on the analysis tasks, and the challenges in building efficient systems that can support such tasks at a large scale. In this dissertation, I design a unified streaming graph data management framework, and develop prototype systems to support increasingly complex tasks on dynamic graphs. In the first part, I focus on the management and querying of distributed graph data. I develop a hybrid replication policy that monitors the read-write frequencies of the nodes to decide dynamically what data to replicate, and whether to do eager or lazy replication in order to minimize network communication and support low-latency querying. In the second part, I study parallel execution of continuous neighborhood-driven aggregates, where each node aggregates the information generated in its neighborhoods. I build my system around the notion of an aggregation overlay graph, a pre-compiled data structure that enables sharing of partial aggregates across different queries, and also allows partial pre-computation of the aggregates to minimize the query latencies and increase throughput. Finally, I extend the framework to support continuous detection and analysis of activity-based subgraphs, where subgraphs could be specified using both graph structure as well as activity conditions on the nodes. The query specification tasks in my system are expressed using a set of active structural primitives, which allows the query evaluator to use a set of novel optimization techniques, thereby achieving high throughput. Overall, in this dissertation, I define and investigate a set of novel tasks on dynamic graphs, design scalable optimization techniques, build prototype systems, and show the effectiveness of the proposed techniques through extensive evaluation using large-scale real and synthetic datasets.Item On the Foundations of Data Interoperability and Semantic Search on the Web(2011) Haidarian Shahri, Hamid; Perlis, Donald; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation studies the problem of facilitating semantic search across disparate ontologies that are developed by different organizations. There is tremendous potential in enabling users to search independent ontologies and discover knowledge in a serendipitous fashion, i.e., often completely unintended by the developers of the ontologies. The main difficulty with such search is that users generally do not have any control over the naming conventions and content of the ontologies. Thus terms must be appropriately mapped across ontologies based on their meaning. The meaning-based search of data is referred to as semantic search, and its facilitation (aka semantic interoperability) then requires mapping between ontologies. In relational databases, searching across organizational boundaries currently involves the difficult task of setting up a rigid information integration system. Linked Data representations more flexibly tackle the problem of searching across organizational boundaries on the Web. However, there exists no consensus on how ontology mapping should be performed for this scenario, and the problem is open. We lay out the foundations of semantic search on the Web of Data by comparing it to keyword search in the relational model and by providing effective mechanisms to facilitate data interoperability across organizational boundaries. We identify two sharply distinct goals for ontology mapping based on real-world use cases. These goals are: (i) ontology development, and (ii) facilitating interoperability. We systematically analyze these goals, side-by-side, and contrast them. Our analysis demonstrates the implications of the goals on how to perform ontology mapping and how to represent the mappings. We rigorously compare facilitating interoperability between ontologies to information integration in databases. Based on the comparison, class matching is emphasized as a critical part of facilitating interoperability. For class matching, various class similarity metrics are formalized and an algorithm that utilizes these metrics is designed. We also experimentally evaluate the effectiveness of the class similarity metrics on real-world ontologies. In order to encode the correspondences between ontologies for interoperability, we develop a novel W3C-compliant representation, named skeleton.