High-Throughput Network Distance Computations for Spatial Analytics Inside Any Store
MetadataShow full item record
In the past decades, shortest distance computation methods for road networks have been developed that focus on how to speed up the latency of a single source-target pair distance query. Large analytic applications on road networks including simulations (e.g., evacuation planning), logistics, location-based advertisement, and transportation planning require methods that provide high throughput (i.e., distance computations per second) and the ability to "scale out" by using large distributed computing clusters. Although decreasing the latency time for one source-target query results in reducing the total response time for a spatial analytic query, it is far from enough since these methods don’t take into account considerations such as cache results, query optimization, multi-threads, distributed systems, etc. This thesis broadly expands on the use of the distance oracle on road networks to achieve above goals. In the first part, we present a new framework termed the All-Store Distance Oracle (ASDO) for large road networks and shows how to efficiently compute it for any large road network in a distributed cluster. The ASDO representation is a well-separated pair decomposition (WSPD) of a road network using network distance instead of Euclidean distance. The ASDO representation benefits from the small size of the WSPD which enables the ASDO representation to answer ε-approximate network distance queries in a high-throughput rate and can be easily embedded within any database system including RDBMS, Column-oriented DBMS, and key-value stores. Experimental results show that the ASDO representation of the USA road network can be computed in a few hours using a modest size cluster. In comparison, previous database-centric approaches either do not scale to large road networks or are several orders of magnitude slower than the proposed ASDO for spatial queries. In the second part , we show how useful the ASDO representation is in real applications evaluating two proposed architectures on a variety of spatial analytic queries in common use such as KNN, distance matrix, and trajectory queries. One architecture is our ASDO representation embedded in PostgreSQL, and the other one is a widely used hybrid architecture in industry. Embedding the ASDO representation inside PostgreSQL supports the performance of complex analytic queries on road networks using standard SQL. This makes the results of ASDO simple to use, yet considerably expressive, compared to traditional methods that require extensive development effort. Experimental results indicate that our ASDO architecture within PostgreSQL can compute more than 60K road distance operations per second on a large road network (e.g., USA), which achieves 20× more throughput compared to the state-of-the-art shortest distance computation methods. In the third part, as some applications require the ability to scale out on large distributed computing clusters, a framework called SPDO is presented which implements an extremely fast distributed algorithm for computing spatial analytic queries on Apache Spark. The approach extends the ASDO representation which has now been adapted to use Spark’s resilient distributed dataset (RDD). SPDO improves the throughput by at least two orders of magnitude, which makes the approach suitable for applications that need to compute millions of network distances per second. Interviews with tens of related companies whom we deemed to be needy of performing some analytic queries on road networks led us to observe that they are usually concentrated in a local area spanning several cities, and need a high-throughput solution such as performing millions of shortest distance computations per second. In the forth part, we first demonstrate a solution, termed City Distance Oracles (CDO) to achieve as many as 7 million shortest distance computations per second per commodity machine on a city road network. Next, we extend CDO to yield a new distance oracle system (DOS) for general road networks. It can solve most spatial analytic queries, and its throughput achieves 5M distance computations per second even on the whole USA road network. In addition, a 10K × 10K origin-distance (OD) matrix can be computed in 20 seconds.