Thumbnail Image


Publication or External Link






The "data explosion'' since the era of the Internet has increased data size tremendously, from several hundred Megabytes to millions of Terabytes. Large amounts of data may not fit into memory, and a proper way of handling and processing the data is necessary. Besides, analyses of such large scale data requires complex and time consuming algorithms. On the other hand, humans play an important role in steering and driving the data analysis, while there are often times when people have a hard time getting an overview of the data or knowing which analysis to run. Sometimes they may not even know where to start. There is a huge gap between the data and understanding.

An intuitive way to facilitate data analysis is to visualize it. Visualization is understandable and illustrative, while using it to support fast and rapid data exploration of large scale datasets has been a challenge for a long time. In this dissertation, we aim to facilitate efficient

visual data exploration of large scale datasets from two perspectives: efficiency and interaction. The former indicates how users could understand the data efficiently, this depends on various factors, such as how fast data is processed and how data is presented, while the latter focuses more on the users: how they deal with the data and why they interact with the system in a particular way.

In order to improve the efficiency of data exploration, we have looked into two steps in the visualization pipeline: rendering and processing (computations). We first address visualization rendering of large dataset through a thorough evaluation of web-based visualization performance. We evaluate and understand the page loading effects of Scalable Vector Graphics (SVG), a popular image format for interactive visualization on the web browsers. To understand the scalability of individual elements in SVG based visualization, we conduct performance tests on different types of charts, in different phases of rendering process. From the results, we have figured out optimization techniques and guidelines to achieve better performance when rendering SVG visualization.

Secondly, we present a pure browser based distributed computing framework (VisHive) that exploits computational power from co-located idle devices for visualization. The VisHive framework speeds up web-based visualization, which is originally designed for single computer and cannot make use of additional computational resources on the client side. It takes advantage of multiple devices that today's users often have access to. VisHive constructs visualization applications that can transparently connect multiple devices into an ad-hoc cluster for local computation. It requires no specific software to be downloaded for setup.

To achieve a more interactive data analysis process, we first propose a proactive visual analytics system (DataSite) that enable users to analyze the data smoothly with a list of pre-defined algorithms. DataSite provides results through selecting and executing computations using automatic server-side computation. It utilizes computational resources exhaustively during data analysis to reduce the burden of human thinking. Analyzing results identified by these background processes are surfaced as status updates in a feed on the front-end, akin to posts in a social media feed. DataSite effectively turns data analysis into a conversation between the user and the computer, thereby reducing the cognitive load and domain knowledge requirements on users.

Next we apply the concept of proactive data analysis to genomic data, and explore how to improve data analysis through adaptive computations in bioinformatics domain. We build Epiviz Feed, a web application that supports proactive visual and statistical analysis of genomic data. It addresses common and popular biological questions that may be asked by the analyst, and shortens the time of processing and analyzing the data with automatic computations.

We further present a computational steering mechanism for visual analytics that prioritizes computations performed on the dataset leveraging the analyst's navigational behavior in the data. The web-based system, called Sherpa, provides computational modules for genomic data analysis, where independent algorithms calculate test statistics relevant to biological inferences about gene regulation in various tumor types and their corresponding normal tissues.