Computer Science Research Works
Permanent URI for this collection
Browse
Browsing Computer Science Research Works by Title
Now showing 1 - 20 of 153
Results Per Page
Sort Options
Item A Grammar-Based Approach for Applying Visualization Taxonomies to Interaction Logs(Wiley, 2022-07-29) Gathani, Sneha; Monadjemi, Shayan; Ottley, Alvitta; Battle, LeilaniResearchers collect large amounts of user interaction data with the goal of mapping user's workflows and behaviors to their high-level motivations, intuitions, and goals. Although the visual analytics community has proposed numerous taxonomies to facilitate this mapping process, no formal methods exist for systematically applying these existing theories to user interaction logs. This paper seeks to bridge the gap between visualization task taxonomies and interaction log data by making the taxonomies more actionable for interaction log analysis. To achieve this, we leverage structural parallels between how people express themselves through interactions and language by reformulating existing theories as regular grammars. We represent interactions as terminals within a regular grammar, similar to the role of individual words in a language, and patterns of interactions or non-terminals as regular expressions over these terminals to capture common language patterns. To demonstrate our approach, we generate regular grammars for seven existing visualization taxonomies and develop code to apply them to three public interaction log datasets. In analyzing these regular grammars, we find that the taxonomies at the low-level (i.e., terminals) show mixed results in expressing multiple interaction log datasets, and taxonomies at the high-level (i.e., regular expressions) have limited expressiveness, due to primarily two challenges: inconsistencies in interaction log dataset granularity and structure, and under-expressiveness of certain terminals. Based on our findings, we suggest new research directions for the visualization community to augment existing taxonomies, develop new ones, and build better interaction log recording processes to facilitate the data-driven development of user behavior taxonomies.Item A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training(Association for Computer Machinery (ACM), 2023-06-21) Singh, Siddarth; Ruwase, Olatunji; Awan, Ammar Ahmad; Rajbhandari, Samyam; He, Yuxiong; Bhatele, AbhinavMixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.Item A Review and Collation of Graphical Perception Knowledge for Visualization Recommendation(Association for Computer Machinery (ACM), 2023-04-23) Zeng, Zhua; Battle, LeilaniSelecting appropriate visual encodings is critical to designing effective visualization recommendation systems, yet few findings from graphical perception are typically applied within these systems. We observe two significant limitations in translating graphical perception knowledge into actionable visualization recommendation rules/constraints: inconsistent reporting of findings and a lack of shared data across studies. How can we translate the graphical perception literature into a knowledge base for visualization recommendation? We present a review of 59 papers that study user perception and performance across ten visual analysis tasks. Through this study, we contribute a JSON dataset that collates existing theoretical and experimental knowledge and summarizes key study outcomes in graphical perception. We illustrate how this dataset can inform automated encoding decisions with three representative visualization recommendation systems. Based on our findings, we highlight open challenges and opportunities for the community in collating graphical perception knowledge for a range of visualization recommendation scenarios.Item Absynthe: Abstract Interpretation-Guided Synthesis(Association for Computer Machinery (ACM), 2023-06) Guria, Sankha Narayan; Foster, Jeffrey S.; Van Horn, DavidSynthesis tools have seen significant success in recent times. However, past approaches often require a complete and accurate embedding of the source language in the logic of the underlying solver, an approach difficult for industrial-grade languages. Other approaches couple the semantics of the source language with purpose-built synthesizers, necessarily tying the synthesis engine to a particular language model. In this paper, we propose Absynthe, an alternative approach based on user-defined abstract semantics that aims to be both lightweight and language agnostic, yet effective in guiding the search for programs. A synthesis goal in Absynthe is specified as an abstract specification in a lightweight user-defined abstract domain and concrete test cases. The synthesis engine is parameterized by the abstract semantics and independent of the source language. Absynthe validates candidate programs against test cases using the actual concrete language implementation to ensure correctness. We formalize the synthesis rules for Absynthe and describe how the key ideas are scaled-up in our implementation in Ruby. We evaluated Absynthe on SyGuS strings benchmark and found it competitive with other enumerative search solvers. Moreover, Absynthe’s ability to combine abstract domains allows the user to move along a cost spectrum, i.e., expressive domains prune more programs but require more time. Finally, to verify Absynthe can act as a general purpose synthesis tool, we use Absynthe to synthesize Pandas data frame manipulating programs in Python using simple abstractions like types and column labels of a data frame. Absynthe reaches parity with AutoPandas, a deep learning based tool for the same benchmark suite. In summary, our results demonstrate Absynthe is a promising step forward towards a general-purpose approach to synthesis that may broaden the applicability of synthesis to more full-featured languages.Item Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences(Springer Nature, 2011-07-27) Liu, Bo; Gibbons, Theodore; Ghodsi, Mohammad; Treangen, Todd; Pop, MihaiA major goal of metagenomics is to characterize the microbial composition of an environment. The most popular approach relies on 16S rRNA sequencing, however this approach can generate biased estimates due to differences in the copy number of the gene between even closely related organisms, and due to PCR artifacts. The taxonomic composition can also be determined from metagenomic shotgun sequencing data by matching individual reads against a database of reference sequences. One major limitation of prior computational methods used for this purpose is the use of a universal classification threshold for all genes at all taxonomic levels. We propose that better classification results can be obtained by tuning the taxonomic classifier to each matching length, reference gene, and taxonomic level. We present a novel taxonomic classifier MetaPhyler (http://metaphyler.cbcb.umd.edu), which uses phylogenetic marker genes as a taxonomic reference. Results on simulated datasets demonstrate that MetaPhyler outperforms other tools commonly used in this context (CARMA, Megan and PhymmBL). We also present interesting results by analyzing a real metagenomic dataset. We have introduced a novel taxonomic classification method for analyzing the microbial diversity from whole-metagenome shotgun sequences. Compared with previous approaches, MetaPhyler is much more accurate in estimating the phylogenetic composition. In addition, we have shown that MetaPhyler can be used to guide the discovery of novel organisms from metagenomic samples.Item Ada Reusability and Measurement(1990-05) Basili, Victor R.; Rombach, H. D.; Bailey, J.; Delis, A.Item Adaptive Pull-Based Policies for Wide Area Data Delivery(2005-10-19T16:15:35Z) Bright, Laura; Gal, Avigdor; Raschid, LouiqaWide area data delivery requires timely propagation of up-to-date information to thousands of clients over a wide area network. Applications include web caching, RSS source monitoring, and email access via a mobile network. Data sources vary widely in their update patterns and may experience different update rates at different times or unexpected changes to update patterns. Traditional data delivery solutions are either push-based, which requires servers to push updates to clients, or pull-based, which require clients to check for updates at servers. While push-based solutions ensure timely data delivery, they are not always feasible to implement and may not scale to a large number of clients. In this paper we present adaptive pull-based policies that can reduce the overhead of contacting remote servers compared to existing pull-based policies. We model updates to data sources using update histories, and present novel history-based policies to estimate when updates occur. We present a set of architectures to enable rapid deployment of the proposed policies. We develop adaptive policies to handle changes in update patterns, and present two examples of such policies. Extensive experimental evaluation using three data traces from diverse applications shows that history-based policies can reduce contact between clients and servers by up to 60% compared to existing pull-based policies while providing a comparable level of data freshness. Our adaptive policies are further shown to dominate individual history based policies.Item AGORA: Assembly Guided by Optical Restriction Alignment(2012-08-02) Lin, Henry; Goldstein, Steve; Mendelowitz, Lee; Shiguo, Zhou; Wetzel, Joshua; Schwartz, David C; Pop, MihaiBackground: Genome assembly is difficult due to repeated sequences within the genome, which create ambiguities and cause the final assembly to be broken up into many separate sequences (contigs). Long range linking information, such as mate-pairs or mapping data, is necessary to help assembly software resolve repeats, thereby leading to a more complete reconstruction of genomes. Prior work has used optical maps for validating assemblies and scaffolding contigs, after an initial assembly has been produced. However, optical maps have not previously been used within the genome assembly process. Here, we use optical map information within the popular de Bruijn graph assembly paradigm to eliminate paths in the de Bruijn graph which are not consistent with the optical map and help determine the correct reconstruction of the genome. Results: We developed a new algorithm called AGORA: Assembly Guided by Optical Restriction Alignment. AGORA is the first algorithm to use optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided. Our simulations on bacterial genomes show that AGORA is effective at producing assemblies closely matching the reference sequences. Additionally, we show that noise in the optical map can have a strong impact on the final assembly quality for some complex genomes, and we also measure how various characteristics of the starting de Bruijn graph may impact the quality of the final assembly. Lastly, we show that a proper choice of restriction enzyme for the optical map may substantially improve the quality of the final assembly. Conclusions: Our work shows that optical maps can be used effectively to assemble genomes within the de Bruijn graph assembly framework. Our experiments also provide insights into the characteristics of the mapping data that most affect the performance of our algorithm, indicating the potential benefit of more accurate optical mapping technologies, such as nano-coding.Item Alignment and clustering of phylogenetic markers - implications for microbial diversity studies(2010-03-24) White, James R; Navlakha, Saket; Nagarajan, Niranjan; Ghodsi, Mohammad-Reza; Kingsford, Carl; Pop, MihaiBackground: Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies. Results: Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters. Conclusions: Our results highlight the need for systematic and open evaluation of data analysis methodologies,especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website http://fames.jgi-psf. org/.Item Analyzing A Syntactic Family of Complexity Metrics(1983-03) Hutchens, David H.; Basili, Victor R.Item Analyzing Error-Prone System Coupling and Cohesion(1988-06) Selby, Richard W.; Basili, Victor R.Item Assembly complexity of prokaryotic genomes using short reads(2010-01-12) Kingsford, Carl; Schatz, Michael C; Pop, MihaiBackground: De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results: We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions: Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.Item Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments(Genome Biology, 2008-01) Haas, Brian U.; Salzberg, Steven L.; Zhu, Wei; Pertea, Mihaela; Allen, Jonathan E.; Orvis, Joshua; White, Owen; Buell, C Robin; Wortman, Jennifer REVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.Item Automating NISQ Application Design with Meta Quantum Circuits with Constraints (MQCC)(Association for Computer Machinery (ACM), 2023-04) Deng, Haowei; Peng, Yuxiang; Hicks, Michael; Wu, XiaodiNear-term intermediate scale quantum (NISQ) computers are likely to have very restricted hardware resources, where precisely controllable qubits are expensive, error-prone, and scarce. Programmers of such computers must therefore balance trade-offs among a large number of (potentially heterogeneous) factors specific to the targeted application and quantum hardware. To assist them, we propose Meta Quantum Circuits with Constraints (MQCC), a meta-programming framework for quantum programs. Programmers express their application as a succinct collection of normal quantum circuits stitched together by a set of (manually or automatically) added meta-level choice variables, whose values are constrained according to a programmable set of quantitative optimization criteria. MQCC’s compiler generates the appropriate constraints and solves them via an SMT solver, producing an optimized, runnable program. We showcase a few MQCC’s applications for its generality including an automatic generation of efficient error syndrome extraction schemes for fault-tolerant quantum error correction with heterogeneous qubits and an approach to writing approximate quantum Fourier transformation and quantum phase estimation that smoothly trades off accuracy and resource use. We also illustrate that MQCC can easily encode prior one-off NISQ application designs-–multi-programming (MP), crosstalk mitigation (CM)—as well as a combination of their optimization goals (i.e., a combined MP-CM).Item Can RNA-Seq Resolve the Rapid Radiation of Advanced Moths and Butterflies (Hexapoda: Lepidoptera: Apoditrysia)? An Exploratory Study(PLoS One, 2013-12-04) Bazinet, Adam L.; Cummings, Michael P.; Mitter, Kim T.; Mitter, Charles W.Recent molecular phylogenetic studies of the insect order Lepidoptera have robustly resolved family-level divergences within most superfamilies, and most divergences among the relatively species-poor early-arising superfamilies. In sharp contrast, relationships among the superfamilies of more advanced moths and butterflies that comprise the mega-diverse clade Apoditrysia (ca. 145,000 spp.) remain mostly poorly supported. This uncertainty, in turn, limits our ability to discern the origins, ages and evolutionary consequences of traits hypothesized to promote the spectacular diversification of Apoditrysia. Low support along the apoditrysian “backbone” probably reflects rapid diversification. If so, it may be feasible to strengthen resolution by radically increasing the gene sample, but case studies have been few. We explored the potential of next-generation sequencing to conclusively resolve apoditrysian relationships. We used transcriptome RNA-Seq to generate 1579 putatively orthologous gene sequences across a broad sample of 40 apoditrysians plus four outgroups, to which we added two taxa from previously published data. Phylogenetic analysis of a 46-taxon, 741-gene matrix, resulting from a strict filter that eliminated ortholog groups containing any apparent paralogs, yielded dramatic overall increase in bootstrap support for deeper nodes within Apoditrysia as compared to results from previous and concurrent 19-gene analyses. High support was restricted mainly to the huge subclade Obtectomera broadly defined, in which 11 of 12 nodes subtending multiple superfamilies had bootstrap support of 100%. The strongly supported nodes showed little conflict with groupings from previous studies, and were little affected by changes in taxon sampling, suggesting that they reflect true signal rather than artifacts of massive gene sampling. In contrast, strong support was seen at only 2 of 11 deeper nodes among the “lower”, non-obtectomeran apoditrysians. These represent a much harder phylogenetic problem, for which one path to resolution might include further increase in gene sampling, together with improved orthology assignments.Item CANARD: A dataset for Question-in-Context Rewriting(2019-11-03) Ghoneim, Ahmed Elgohary; Peskov, Denis; Boyd-Graber, JordanIn conversational question answering multiple questions in an information-seeking dialogs which requires models to link questions together to resolve the conversational dependencies between them: each question needs to be under- stood in the conversation context. For example, the question “What was he like in that episode?” cannot be understood without knowing what “he” and “that episode” refer to, which can be resolved using the conversation context. CANARD is a dataset of 40,000 questions asked in conversational contexts paired with their gold context-independent (stand-alone) rewrite.Item Challenges of Navigational Queries: Finding Best Paths in Graphs(2005-10-06T16:30:25Z) Raschid, Louiqa; Vidal, Maria-Esther; Wu, Yao; Cardenas, Marelis; Marquez, NataliaLife science sources are characterized by a complex graph of overlapping sources, and multiple alternate links between sources. A (navigational) query may be answered by traversing multiple alternate paths between an origin and target source. Paths may be characterized by several metrics, including the cardinality of objects of the target source(TOC), the cost of query evaluation of a plan for the path, and the user's preference for specific paths. Our challenge is finding the best paths among the set of all solutions, AllPaths, that meet some user specified ranking criteria. If the user ranking criteria is strict, then the problem is to find the Top K paths. If the user wants a trade-off of several metrics, then the problem is to find the Skyline paths that are not dominated by other paths. {\em NSearch} is a naive solution. {\em BFSrchOpt} is a heuristic best-first search strategy. It uses a metric to rank partial solutions (subpaths) and (local) metrics to guide graph traversal, and produces BFPaths. We compare the precision and recall of BFPaths compared to the Top K\% or Skyline of AllPaths. We study the impact of graph properties on the behavior of {\em BFSrchOpt}. {\em BFSrchOpt} can be orders of magnitude faster than {\em NSearch}.Item Characterizing Resource Data: A Model for Logical Association of Software Data(1987-05) Jeffrey, D. Ross; Basili, Victor R.Item CLEANROOM Software Development: An Empirical Evaluation(1985-02) Selby, Richard W., Jr; Basili, Victor R.; Baker, F. TerryItem CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing(Springer Nature, 2011-08-30) Angiuoli, Samuel V; Matalka, Malcolm; Gussman, Aaron; Galens, Kevin; Vangala, Mahesh; Riley, David R; Arze, Cesar; White, James R; White, Owen; Fricke, W FlorianNext-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.