Browsing by Author "Seo, Jinwook"
Now showing 1 - 13 of 13
Results Per Page
Sort Options
Item Building a Coherent Data Pipeline in Microarray Data Analyses: Optimization of Signal/Noise Ratios Using an Interactive Visualization Tool and a Novel Noise Filtering Method (2003)(2005) Seo, Jinwook; Bakay, Marina; Chen, Yi-Wen; Hilmer, Sara; Shneiderman, Ben; Hoffman, Eric P.; ISRMotivation: Sources of uncontrolled noise strongly influence data analysis in microarray studies, yet signal/noise ratios are rarely considered in microarray data analyses. We hypothesized that different research projects would have different sources and levels of confounding noise, and built an interactive visual analysis tool to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Results: Five probe set algorithms were studied with and without statistical weighting of probe sets using Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups) (extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups) (little noise). Success was measured using F-measure value of success of unsupervised clustering into appropriate biological groups (signal). We show that both probe set signal algorithm and probe set detection p-value weighting have a strong effect on signal/noise ratios, and that the different methods performed quite differently in the two data sets. Among the signal algorithms tested, dChip difference model with p-value weighting was the most consistent at maximizing the effect of the target biological variables on data interpretation of the two data sets. Availability: The Hierarchical Clustering Explorer 2.0 is [url=http://www.cs.umd.edu/hcil/hce/]available[/url] online and the improved version of the Hierarchical Clustering Explorer 2.0 with p-value weighting and Fmeasure is available upon request to the first author. Murine arrays (40 samples) are publicly available at the [url=http://microarray.cnmcresearch.org/pgadatatable.asp]PEPR resource.[/url] (Chen et al., 2004).Item In vivo filtering of in vitro MyoD target data: An approach for identification of biologically relevant novel downstream targets of transcription factors (2003)(2005) Zhao, Po; Seo, Jinwook; Wang, Zuyi; Wang, Yue; Shneiderman, Ben; Hoffman, Eric P.; ISRWe report a novel approach to identification of downstream targets of MyoD, where a published set of candidate targets from a well-controlled in vitro experiment [1] is filtered for relevance to muscle regeneration using a 27 time point in vivo murine regeneration series. Using both interactive hierarchical clustering (HCE) [2], and Bayes soft clustering (VISDA) [3,4]. We show that only a minority of in vitroefined candidates can be confirmed in vivo (~50% of induced transcripts, and none of repressed transcripts). The concordance of the in vitro, in vivo datasets, and both HCE and VISDA analytical techniques showed strong support for 18 targets (13 no vel) of MyoD that are biologically relevant during myoblast differentiation, including Cdh15, L-myc, Hes6, Stam, Tnnt2, Fyn, Rapsn, Nestin, Osp94, Pep4, Mef2a, Sh3glb1 and Rb1.Item INFORMATION VISUALIZATION DESIGN FOR MULTIDIMENSIONAL DATA: INTEGRATING THE RANK-BY-FEATURE FRAMEWORK WITH HIERARCHICAL CLUSTERING(2005-04-20) Seo, Jinwook; Shneiderman, Ben; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Interactive exploration of multidimensional data sets is challenging because: (1) it is difficult to comprehend patterns in more than three dimensions, and (2) current systems are often a patchwork of graphical and statistical methods leaving many researchers uncertain about how to explore their data in an orderly manner. This dissertation offers a set of principles and a novel rank-by-feature framework that could enable users to better understand multidimensional and multivariate data by systematically studying distributions in one (1D) or two dimensions (2D), and then discovering relationships, clusters, gaps, outliers, and other features. Users of this rank-by-feature framework can view graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections. By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods, users can systematically examine the most important 1D and 2D axis-parallel projections. This research provides a number of valuable contributions: Graphics, Ranking, and Interaction for Discovery (GRID) principles- a set of principles for exploratory analysis of multidimensional data, which are summarized as: (1) study 1D, study 2D, then find features (2) ranking guides insight, statistics confirm. GRID principles help users organize their discovery process in an orderly manner so as to produce more thorough analyses and extract deeper insights in any multidimensional data application. Rank-by-feature framework - a user interface framework based on the GRID principles. Interactive information visualization techniques are combined with statistical methods and data mining algorithms to enable users to orderly examine multidimensional data sets using 1D and 2D projections. The design and implementation of the Hierarchical Clustering Explorer (HCE), an information visualization tool available at www.cs.umd.edu/hcil/hce. HCE implements the rank-by-feature framework and supports interactive exploration of hierarchical clustering results to reveal one of the important features - clusters. Validation through case studies and user surveys: Case studies with motivated experts in three research fields and a user survey via emails to a wide range of HCE users demonstrated the efficacy of HCE and the rank-by-feature framework. These studies also revealed potential improvement opportunities in terms of design and implementation.Item Interactive Color Mosaic and Dendogram Displays for Signal/Noise Optimization in Microarray Data Analysis (2003)(2005) Seo, Jinwook; Bakay, Marina; Zhao, Po; Chen, Yi-Wen; Clarkson, Priscilla; Shneiderman, Ben; Hoffman, Eric P.; ISRData analysis and visualization is strongly influenced by noise and noise filters. There are multiple sources of oisein microarray data analysis, but signal/noise ratios are rarely optimized, or even considered. Here, we report a noise analysis of a novel 13 million oligonucleotide dataset - 25 human U133A (~500,000 features) profiles of patient muscle biposies. We use our recently described interactive visualization tool, the Hierarchical Clustering Explorer (HCE) to systemically address the effect of different noise filters on resolution of arrays into orrectbiological groups (unsupervised clustering into three patient groups of known diagnosis). We varied probe set interpretation methods (MAS 5.0, RMA), resent callfilters, and clustering linkage methods, and investigated the results in HCE. HCE interactive features enabled us to quickly see the impact of these three variables. Dendrogram displays showed the clustering results systematically, and color mosaic displays provided a visual support for the results. We show that each of these three variables has a strong effect on unsupervised clustering. For this dataset, the strength of the biological variable was maximized, and noise minimized, using MAS 5.0, 10% present call filter, and Average Group Linkage. We propose a general method of using interactive tools to identify the optimal signal/noise balance or the optimal combination of these three variables to maximize the effect of the desired biological variable on data interpretation.Item Interactive Exploration of Multidimensional Microarray Data: Scatterplot Ordering, Gene Ontology Browser, and Profile Search (2003)(2005) Seo, Jinwook; Shneiderman, Ben; ISRMultidimensional data sets are common in many research areas, including microarray experiment data sets. Genome researchers are using cluster analysis to find meaningful groups in microarray data. However, the high dimensionality of the data sets hinders users from finding interesting patterns, clusters, and outliers. Determining the biological significance of such features remains problematic due to the difficulties of integrating biological knowledge. In addition, it is not efficient to perform a cluster analysis over the whole data set in cases where researchers know the approximate temporal pattern of the gene expression that they are seeking.To address these problems, we add three new features to the Hierarchical Clustering Explorer (HCE): (1) scatterplot ordering methods so that all 2D projections of a high dimensional data set can be ordered according to relevant criteria, (2) a gene ontology browser, coupled with clustering results so that known gene functions within a cluster can be easily studied, (3) a profile search so that genes with a certain temporal pattern can be easily identified.
Item Knowledge Discovery in High Dimensional Data: Case Studies and a User Survey for an Information Visualization Tool(2005) Seo, Jinwook; Shneiderman, Ben; ISRKnowledge discovery in high dimensional data is a challenging enterprise, but new visual analytic tools appear to offer users remarkable powers if they are ready to learn new concepts and interfaces. Our 3-year effort to develop versions of the Hierarchical Clustering Explorer (HCE) began with building an interactive tool for exploring clustering results. It expanded, based on user needs, to include other potent analytic and visualization tools for multivariate data, especially the rank-by-feature framework. Our own successes using HCE provided some testimonial evidence of its utility, but we felt it necessary to get beyond our subjective impressions. This paper presents an evaluation of the Hierarchical Clustering Explorer (HCE) using three case studies and an email user survey (n=57) to focus on skill acquisition with the novel concepts and interface for the rank-by-feature framework. Knowledgeable and motivated users in diverse fields provided multiple perspectives that refined our understanding of strengths and weaknesses. A user survey confirmed the benefits of HCE, but gave less guidance about improvements. Both evaluations suggested improved training methods.Item A Knowledge Integration Framework for Information Visualization (2004)(2005) Seo, Jinwook; Shneiderman, Ben; ISRUsers can better understand complex data sets by combining insights from multiple coordinated visual displays that include relevant domain knowl-edge. When dealing with multidimensional data and clustering results, the most familiar displays and comprehensible are 1- and 2-dimensional projections (his-tograms, and scatterplots). Other easily understood displays of domain knowl-edge are tabular and hierarchical information for the same or related data sets. The novel parallel coordinates view [6] powered by a direct-manipulation search, offers strong advantages, but requires some training for most users. We provide a review of related work in the area of information visualization, and introduce new tools and interaction examples on how to incorporate usersdo-main knowledge for understanding clustering results. Our examples present hi-erarchical clustering of gene expression data, coordinated with a parallel coor-dinates view and with the gene annotation and gene ontology.Item A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data (2004)(2005) Seo, Jinwook; Shneiderman, Ben; ISRInteractive exploration of multidimensional data sets is challenging because: (1) it is difficult to comprehend patterns in more than three dimensions, and (2) current systems often are a patchwork of graphical and statistical methods leaving many researchers uncertain about how to explore their data in an orderly manner. We offer a set of principles and a novel rank-by-feature framework that could enable users to better understand distributions in one (1D) or two dimensions (2D), and then discover relationships, clusters, gaps, outliers, and other features. Users of our framework can view graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections. By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods users can systematically examine the most important 1D and 2D axis-parallel projections. We summarize our Graphics, Ranking, and Interaction for Discovery (GRID) principles as: (1) 1D, 2D, then features (2) graphics, ranking, summaries, then statistics. We implemented the rank-by-feature framework in the Hierarchical Clustering Explorer, but the same data exploration principles could enable users to organize their discovery process so as to produce more thorough analyses and extract deeper insights in any multidimensional data application, such as spreadsheets, statistical packages, or information visualization tools.Item A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections (2004)(2005) Seo, Jinwook; Shneiderman, Ben; ISRExploratory analysis of multidimensional data sets is challenging because of the difficulty in comprehending more than three dimensions. Two fundamental statistical principles for the exploratory analysis are (1) to examine each dimension first and then find relationships among dimensions, and (2) to try graphical displays first and then find numerical summaries [1]. We implement these principles in a novel conceptual framework called the rank-by-feature framework. In the framework, users can choose a ranking criterion interesting to them and sort 1D or 2D axis-parallel projections according to the criterion. We introduce the rank-by-feature prism that is a color-coded lower-triangular matrix that guides users to desired features. Statistical graphs (histogram, boxplot, and scatterplot) and information visualization techniques (overview, coordination, and dynamic query) are combined to help users effectively traverse 1D and 2D axis-parallel projections, and finally to help them interactively find interesting features.Item Understanding Clusters in Multidimensional Spaces: Making Meaning by Combining Insights from Coordinated Views of Domain Knowledge (2004)(2005) Seo, Jinwook; Shneiderman, Ben; ISRCluster analysis of multidimensional data is widely used in many research areas including financial, economical, sociological, and biological analyses. Finding natural subclasses in a data set not only reveals interesting patterns but also serves as a basis for further analyses. One of the troubles with cluster analysis is that evaluating how interesting a clustering result is to researchers is subjective, application-dependent, and even difficult to measure. This problem generally gets worse as dimensionality and the number of items grows. The remedy is to enable researchers to apply domain knowledge to facilitate insight about the significance of the clustering result. This article presents a way to better understand a clustering result by combining insights from two interactively coordinated visual displays of domain knowledge. The first is a parallel coordinates view powered by a direct-manipulation search. The second is a domain knowledge view containing a well-understood and meaningful tabular or hierarchical information for the same data set. Our examples depend on hierarchical clustering of gene expression data, coordinated with a parallel coordinates view and with the gene annotation and gene ontology.Item Understanding Hierarchical Clustering Results by Interactive Exploration of Dendrograms: A Case Study with Genomic Microarray Data(2003-01-21) Seo, Jinwook; Shneiderman, BenAbstract: Hierarchical clustering is widely used to find patterns in multi-dimensional datasets, especially for genomic microarray data. Finding groups of genes with similar expression patterns can lead to better understanding of the functions of genes. Early software tools produced only printed results, while newer ones enabled some online exploration. We describe four general techniques that could be used in interactive explorations of clustering algorithms: (1) overview of the entire dataset, coupled with a detail view so that high-level patterns and hot spots can be easily found and examined, (2) dynamic query controls so that users can restrict the number of clusters they view at a time and show those clusters more clearly, (3) coordinated displays: the overview mosaic has a bi-directional link to 2-dimensional scattergrams, (4) cluster comparisons to allow researchers to see how different clustering algorithms group the genes. (UMIACS-TR-2002-50) (HCIL-TR-2002-10)Item Understanding Hierarchical Clustering Results by Interactive Exploration of Dendrograms: A Case Study with Genomic Microarray Data (2002)(2005) Seo, Jinwook; Shneiderman, Ben; ISRHierarchical clustering is widely used to find patterns in multi-dimensional datasets, especially for genomic microarray data. Finding groups of genes with similar expression patterns can lead to better understanding of the functions of genes. Early software tools produced only printed results, while newer ones enabled some online exploration. We describe four general techniques that could be used in interactive explorations of clustering algorithms: (1) overview of the entire dataset, coupled with a detail view so that high-level patterns and hot spots can be easily found and examined, (2) dynamic query controls so that users can restrict the number of clusters they view at a time and show those clusters more clearly, (3) coordinated displays: the overview mosaic has a bi-directional link to 2-dimensional scattergrams, (4) cluster comparisons to allow researchers to see how different clustering algorithms group the genes.Item Using Categorical Information in Multidimensional Data Sets: Interactive Partition and Cluster Comparison(2005) Seo, Jinwook; Shneiderman, Ben; ISRMultidimensional data sets often include categorical information. When most columns have categorical information, clustering the data set by similarity of categorical values can reveal interesting patterns in the data set. However, when the data set includes only a small number (one or two) of categorical columns, the categorical information is probably more useful as a way to partition the data set. For example, researchers might be interested in gene expression data for healthy vs. diseased patients or stock performance for common, preferred, or convertible shares. For these cases, we present a novel way to utilize the categorical information together with clustering algorithms. Instead of incorporating categorical information into the clustering process, we can partition the data set according to categorical information. Clustering is then performed with each subset to generate two or more clustering results, each of which is homogeneous (i.e. only includes the same categorical value for the categorical column). By comparing the partitioned clustering results, users can get meaningful insights into the data set: users can identify an interesting group of items that are differentially/similarly expressed in two different homogeneous partitions. The partition can be done in two different directions: (1) by rows if categorical information is available for each column (e.g. some columns are from disease samples and other columns are from healthy samples) or (2) by a column if a column contains categorical information (e.g. a column represents a categorical attribute such as colors or sex). We designed and implemented an interface to facilitate this interactive partition-based clustering results comparison. Coordination between clustering results displays and comparison results overview enables users to identify interesting clusters, and a simple grid display clearly reveals correspondence between two clusters.