Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
7 results
Search Results
Item VISUAL ANALYTICS FOR OPEN-ENDED TASKS IN TEXT MINING(2018) Park, Deokgun; Elmqvist, Niklas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Overview of documents using topic modeling and multidimensional scaling is helpful in understanding topic distribution. While we can spot clusters visually, it is challenging to characterize them. My research investigates an interactive method to identify clusters by assigning attributes and examining the resulting distributions. ParallelSpaces examines the understanding of topic modeling applied to Yelp business reviews, where businesses and their reviews each constitute a separate visual space. Exploring these spaces enables the characterization of each space using the other. However, the scatterplot-based approach in ParallelSpaces does not generalize to categorical variables due to overplotting. My research proposes an improved layout algorithm for those cases in our follow-up work, Gatherplots, which eliminate overplotting in scatterplots while maintaining individual objects. Another limitation in clustering methods is the fixed number of clusters as a hyperparameter. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time. While ParallelSpaces help characterize the clusters, the attributes are sometimes limited. To extend the analysis by creating a custom mixture of attributes, CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. To help users analyze documents semantically, we develop a technique for user-driven text mining by building a dictionary for topics or concepts in a follow-up study, ConceptVector, which uses word embedding to generate dictionaries interactively and uses those dictionaries to analyze the documents. My dissertation contributes interactive methods to overview documents to integrate the user in text mining loops that currently are non-interactive. The case studies we present in this dissertation provide concrete and operational techniques for directly improving several state-of-the-art text mining algorithms. We summarize those generalizable lessons and discuss the limitations of the visual analytics approach.Item A Visual Analytics Approach to Comparing Cohorts of Event Sequences(2016) Malik, Sana; Shneiderman, Ben; Plaisant, Catherine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sequences of timestamped events are currently being generated across nearly every domain of data analytics, from e-commerce web logging to electronic health records used by doctors and medical researchers. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, researchers often compare two groups, or cohorts, of event sequences to find the differences and similarities between outcomes and processes. With temporal event sequence data, this task is complex because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or metrics about the timestamps themselves (e.g., duration of an event). Running statistical tests to cover all these cases and determining which results are significant becomes cumbersome. Current visual analytics tools for comparing groups of event sequences emphasize a purely statistical or purely visual approach for comparison. Visual analytics tools leverage humans' ability to easily see patterns and anomalies that they were not expecting, but is limited by uncertainty in findings. Statistical tools emphasize finding significant differences in the data, but often requires researchers have a concrete question and doesn't facilitate more general exploration of the data. Combining visual analytics tools with statistical methods leverages the benefits of both approaches for quicker and easier insight discovery. Integrating statistics into a visualization tool presents many challenges on the frontend (e.g., displaying the results of many different metrics concisely) and in the backend (e.g., scalability challenges with running various metrics on multi-dimensional data at once). I begin by exploring the problem of comparing cohorts of event sequences and understanding the questions that analysts commonly ask in this task. From there, I demonstrate that combining automated statistics with an interactive user interface amplifies the benefits of both types of tools, thereby enabling analysts to conduct quicker and easier data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: (1) a taxonomy of metrics for comparing cohorts of temporal event sequences, (2) a statistical framework for exploratory data analysis with a method I refer to as high-volume hypothesis testing (HVHT), (3) a family of visualizations and guidelines for interaction techniques that are useful for understanding and parsing the results, and (4) a user study, five long-term case studies, and five short-term case studies which demonstrate the utility and impact of these methods in various domains: four in the medical domain, one in web log analysis, two in education, and one each in social networks, sports analytics, and security. My dissertation contributes an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learning and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences.Item Measuring and improving the readability of network visualizations(2013) Dunne, Cody; Shneiderman, Ben A; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Network data structures have been used extensively for modeling entities and their ties across such diverse disciplines as Computer Science, Sociology, Bioinformatics, Urban Planning, and Archeology. Analyzing networks involves understanding the complex relationships between entities as well as any attributes, statistics, or groupings associated with them. The widely used node-link visualization excels at showing the topology, attributes, and groupings simultaneously. However, many existing node-link visualizations are difficult to extract meaning from because of (1) the inherent complexity of the relationships, (2) the number of items designers try to render in limited screen space, and (3) for every network there are many potential unintelligible or even misleading visualizations. Automated layout algorithms have helped, but frequently generate ineffective visualizations even when used by expert analysts. Past work, including my own described herein, have shown there can be vast improvements in network visualizations, but no one can yet produce readable and meaningful visualizations for all networks. Since there is no single way to visualize all networks effectively, in this dissertation I investigate three complimentary strategies. First, I introduce a technique called motif simplification that leverages the repeating patterns or motifs in a network to reduce visual complexity. I replace common, high-payoff motifs with easily understandable glyphs that require less screen space, can reveal otherwise hidden relationships, and improve user performance on many network analysis tasks. Next, I present new Group-in-a-Box layouts that subdivide large, dense networks using attribute- or topology-based groupings. These layouts take group membership into account to more clearly show the ties within groups as well as the aggregate relationships between groups. Finally, I develop a set of readability metrics to measure visualization effectiveness and localize areas needing improvement. I detail optimization recommendations for specific user tasks, in addition to leveraging the readability metrics in a user-assisted layout optimization technique. This dissertation contributes an understanding of why some node-link visualizations are difficult to read, what measures of readability could help guide designers and users, and several promising strategies for improving readability which demonstrate that progress is possible. This work also opens several avenues of research, both technical and in user education.Item Interactive Exploration of Temporal Event Sequences(2012) Wongsuphasawat, Krist; Shneiderman, Ben A; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Life can often be described as a series of events. These events contain rich information that, when put together, can reveal history, expose facts, or lead to discoveries. Therefore, many leading organizations are increasingly collecting databases of event sequences: Electronic Medical Records (EMRs), transportation incident logs, student progress reports, web logs, sports logs, etc. Heavy investments were made in data collection and storage, but difficulties still arise when it comes to making use of the collected data. Analyzing millions of event sequences is a non-trivial task that is gaining more attention and requires better support due to its complex nature. Therefore, I aimed to use information visualization techniques to support exploratory data analysis---an approach to analyzing data to formulate hypotheses worth testing---for event sequences. By working with the domain experts who were analyzing event sequences, I identified two important scenarios that guided my dissertation: First, I explored how to provide an overview of multiple event sequences? Lengthy reports often have an executive summary to provide an overview of the report. Unfortunately, there was no executive summary to provide an overview for event sequences. Therefore, I designed LifeFlow, a compact overview visualization that summarizes multiple event sequences, and interaction techniques that supports users' exploration. Second, I examined how to support users in querying for event sequences when they are uncertain about what they are looking for. To support this task, I developed similarity measures (the M&M measure 1-2) and user interfaces (Similan 1-2) for querying event sequences based on similarity, allowing users to search for event sequences that are similar to the query. After that, I ran a controlled experiment comparing exact match and similarity search interfaces, and learned the advantages and disadvantages of both interfaces. These lessons learned inspired me to develop Flexible Temporal Search (FTS) that combines the benefits of both interfaces. FTS gives confident and countable results, and also ranks results by similarity. I continued to work with domain experts as partners, getting them involved in the iterative design, and constantly using their feedback to guide my research directions. As the research progressed, several short-term user studies were conducted to evaluate particular features of the user interfaces. Both quantitative and qualitative results were reported. To address the limitations of short-term evaluations, I included several multi-dimensional in-depth long-term case studies with domain experts in various fields to evaluate deeper benefits, validate generalizability of the ideas, and demonstrate practicability of this research in non-laboratory environments. The experience from these long-term studies was combined into a set of design guidelines for temporal event sequence exploration. My contributions from this research are LifeFlow, a visualization that compactly displays summaries of multiple event sequences, along with interaction techniques for users' explorations; similarity measures (the M&M measure 1-2) and similarity search interfaces (Similan 1-2) for querying event sequences; Flexible Temporal Search (FTS), a hybrid query approach that combines the benefits of exact match and similarity search; and case study evaluations that results in a process model and a set of design guidelines for temporal event sequence exploration. Finally, this research has revealed new directions for exploring event sequences.Item Understanding Scientific Literature Networks: Case Study Evaluations of Integrating Visualizations and Statistics(2011) Gove, Robert Paul; Shneiderman, Ben; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Investigators frequently need to quickly learn new research domains in order to advance their research. This thesis presents five contributions to understanding how software helps researchers explore scientific literature networks. (1) A taxonomy which summarizes capabilities in existing bibliography tools, revealing patterns of capabilities by system type. (2) Six participants in two user studies evaluate Action Science Explorer (ASE), which is designed to create surveys of scientific literature and integrates visualizations and statistics. Users found document-level statistics and attribute rankings to be convenient when beginning literature exploration. (3) User studies also identify users' questions when exploring academic literature, which include examining the evolution of a field, identifying author relationships, and searching for review papers. (4) The evaluations suggest shortcomings of ASE, and this thesis outlines improvements to ASE and lists user requirements for bibliographic exploration. (5) I recommend strategies for evaluating bibliographic exploration tools based on experiences evaluating ASE.Item Visualizing & Exploring Networks Using Semantic Substrates(2008-08-18) Aris, Aleks; Shneiderman, Ben; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Visualizing and exploring network data has been a challenging problem for HCI (Human-Computer Interaction) Information Visualization researchers due to the complexity of representing networks (graphs). Research in this area has concentrated on improving the visual organization of nodes and links according to graph drawing aesthetics criteria, such as minimizing link crossings and the longest link length. Semantic substrates offer a different approach by which node locations represent node attributes. Users define semantic substrates for a given dataset according to the dataset characteristics and the questions, needs, and tasks of users. The substrates are typically 2-5 non-overlapping rectangular regions that meaningfully lay out the nodes of the network, based on the node attributes. Link visibility filters are provided to enable users to limit link visibility to those within or across regions. The reduced clutter and visibility of only selected links are designed to help users find meaningful relationships. This dissertation presents 5 detailed case studies (3 long-term and 2 short-term) that report on sessions with professional users working on their own datasets using successive versions of the NVSS (Network Visualization by Semantic Substrates, http://www.cs.umd.edu/hcil/nvss) software tool. Applications include legal precedent (with court cases citing one another), food-web (predator-prey relationships) data, scholarly paper citations, and U. S. Senate voting patterns. These case studies, which had networks of up to 4,296 nodes and 16,385 links, helped refine NVSS and the semantic substrate approach, as well as understand its limitations. The case study approach enabled users to gain insights and form hypotheses about their data, while providing guidance for NVSS revisions. The proposed guidelines for semantic substrate definitions are potentially applicable to other datasets such as social networks, business networks, and email communication. NVSS appears to be an effective tool because it offers a user-controlled and understandable method of exploring networks. The main contributions of this dissertation include the extensive exploration of semantic substrates, implementation of software to define substrates, guidelines to design good substrates, and case studies to illustrate the applicability of the approach to various domains and its benefits.Item Integrating Statistics and Visualization to Improve Exploratory Social Network Analysis(2008-08-21) Perer, Adam; Shneiderman, Ben; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Social network analysis is emerging as a key technique to understanding social, cultural and economic phenomena. However, social network analysis is inherently complex since analysts must understand every individual's attributes as well as relationships between individuals. There are many statistical algorithms which reveal nodes that occupy key social positions and form cohesive social groups. However, it is difficult to find outliers and patterns in strictly quantitative output. In these situations, information visualizations can enable users to make sense of their data, but typical network visualizations are often hard to interpret because of overlapping nodes and tangled edges. My first contribution improves the process of exploratory social network analysis. I have designed and implemented a novel social network analysis tool, SocialAction (http://www.cs.umd.edu/hcil/socialaction) , that integrates both statistics and visualizations to enable users to quickly derive the benefits of both. Statistics are used to detect important individuals, relationships, and clusters. Instead of tabular display of numbers, the results are integrated with a network visualization in which users can easily and dynamically filter nodes and edges. The visualizations simplify the statistical results, facilitating sensemaking and discovery of features such as distributions, patterns, trends, gaps and outliers. The statistics simplify the comprehension of a sometimes chaotic visualization, allowing users to focus on statistically significant nodes and edges. SocialAction was also designed to help analysts explore non-social networks, such as citation, communication, financial and biological networks. My second contribution extends lessons learned from SocialAction and provides designs guidelines for interactive techniques to improve exploratory data analysis. A taxonomy of seven interactive techniques are augmented with computed attributes from statistics and data mining to improve information visualization exploration. Furthermore, systematic yet flexible design goals are provided to help guide domain experts through complex analysis over days, weeks and months. My third contribution demonstrates the effectiveness of long term case studies with domain experts to measure creative activities of information visualization users. Evaluating information visualization tools is problematic because controlled studies may not effectively represent the workflow of analysts. Discoveries occur over weeks and months, and exploratory tasks may be poorly defined. To capture authentic insights, I designed an evaluation methodology that used structured and replicated long-term case studies. The methodology was implemented on unique domain experts that demonstrated the effectiveness of integrating statistics and visualization.