Thumbnail Image


Publication or External Link





In today’s big data world, data is being produced in massive volumes, at great velocity

and from a variety of different sources such as mobile devices, sensors, a plethora

of small devices hooked to the internet (Internet of Things), social networks, communication

networks and many others. Interactive querying and large-scale analytics are being

increasingly used to derive value out of this big data. A large portion of this data is being

stored and processed in the Cloud due the several advantages provided by the Cloud such

as scalability, elasticity, availability, low cost of ownership and the overall economies

of scale. There is thus, a growing need for large-scale cloud-based data management

systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics

can grow linearly with the time and resources required. Reducing the cost of data analytics

in the Cloud thus remains a primary challenge. In my dissertation research, I have

focused on building efficient and cost-effective cloud-based data management systems for

different application domains that are predominant in cloud computing environments.

In the first part of my dissertation, I address the problem of reducing the cost of

transactional workloads on relational databases to support database-as-a-service in the

Cloud. The primary challenges in supporting such workloads include choosing how to

partition the data across a large number of machines, minimizing the number of distributed

transactions, providing high data availability, and tolerating failures gracefully.

I have designed, built and evaluated SWORD, an end-to-end scalable online transaction

processing system, that utilizes workload-aware data placement and replication to minimize

the number of distributed transactions that incorporates a suite of novel techniques

to significantly reduce the overheads incurred both during the initial placement of data,

and during query execution at runtime.

In the second part of my dissertation, I focus on sampling-based progressive analytics

as a means to reduce the cost of data analytics in the relational domain. Sampling has

been traditionally used by data scientists to get progressive answers to complex analytical

tasks over large volumes of data. Typically, this involves manually extracting samples

of increasing data size (progressive samples) for exploratory querying. This provides the

data scientists with user control, repeatable semantics, and result provenance. However,

such solutions result in tedious workflows that preclude the reuse of work across samples.

On the other hand, existing approximate query processing systems report early results,

but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive

data-parallel computation framework, NOW!, that provides support for progressive

analytics over big data. In particular, NOW! enables progressive relational (SQL) query

support in the Cloud using unique progress semantics that allow efficient and deterministic

query processing over samples providing meaningful early results and provenance

to data scientists. NOW! enables the provision of early results using significantly fewer

resources thereby enabling a substantial reduction in the cost incurred during such analytics.

Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics

on large-scale graph-structured data in the Cloud. The system is based on the

key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in

the graph; examples include ego network analysis, motif counting in biological networks,

finding social circles in social networks, personalized recommendations, link prediction,

etc. These tasks are not well served by existing vertex-centric graph processing frameworks

whose computation and execution models limit the user program to directly access

the state of a single vertex, resulting in high execution overheads. Further, the lack of

support for extracting the relevant portions of the graph that are of interest to an analysis

task and loading it onto distributed memory leads to poor scalability. NSCALE allows

users to write programs at the level of neighborhoods or subgraphs rather than at the level

of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient

distributed execution of these neighborhood-centric complex analysis tasks over largescale

graphs, while minimizing resource consumption and communication cost, thereby

substantially reducing the overall cost of graph data analytics in the Cloud.

The results of our extensive experimental evaluation of these prototypes with several

real-world data sets and applications validate the effectiveness of our techniques

which provide orders-of-magnitude reductions in the overheads of distributed data querying

and analysis in the Cloud.