Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
7 results
Search Results
Item SIMULATION, REPRESENTATION, AND AUTOMATION: HUMAN-CENTERED ARTIFICIAL INTELLIGENCE FOR AUGMENTING VISUALIZATION DESIGN(2024) Shin, Sungbok; Elmqvist, Niklas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Data visualization is a powerful strategy for using graphics to represent data for effective communication and analysis. Unfortunately, creating effective data visualizations is a challenge for both novice and expert design users. The task often involves an iterative process of trial and error, which by its nature, is time-consuming. Designers frequently seek feedback to ensure their visualizations convey the intended message clearly to their target audience. However, obtaining feedback from peers can be challenging, and alternatives like user studies or crowdsourcing is costly and time-consuming. This suggests the potential for a tool that can provide design feedback for visualizations. To that end, I create a virtual, human vision-inspired system that looks into the visualization design and provides feedback on it using various AI techniques. The goal is not to replicate an exact version of a human eye. Instead, my work aims to develop a practical and effective system that delivers design feedback to visualization designers, utilizing advanced AI techniques, such as deep neural networks (DNNs) and large language models (LLMs). My thesis includes three distinct works, each aimed at developing a virtual system inspired by human vision using AI techniques. Specifically, these works focus on simulation, representation, and automation, collectively progressing toward the aim. First, I develop a methodology to simulate human perception in machines through a virtual eye tracker named A SCANNER DEEPLY. This involves gathering eye gaze data from chart images and training them using a DNN. Second, I focus on effectively and pragmatically representing a virtual human vision-inspired system by creating PERCEPTUAL PAT, which includes a suite of perceptually-based filters. Third, I automate the feedback generation process with VISUALIZATIONARY, leveraging large language models to enhance the automation. I report on challenges and lessons learned about the key components and design considerations that help visualization designers. Finally, I end the dissertation by discussing future research directions for using AI for augmenting visualization design process.Item ECOLOGICAL APPLICATIONS OF MACHINE LEARNING TO DIGITIZED NATURAL HISTORY DATA(2022) Robillard, Alexander John; Rowe, Christopher; Bailey, Helen; Marine-Estuarine-Environmental Sciences; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Natural history collections are a valuable resource for assessment of biodiversity and species decline. Over the past few decades, digitization of specimens has increased the accessibility and value of these collections. As such the number and size of these digitized data sets have outpaced the tools needed to evaluate them. To address this, researchers have turned to machine learning to automate data-driven decisions. Specifically, applications of deep learning to complex ecological problems is becoming more common. As such, this dissertation aims to contribute to this trend by addressing, in three distinct chapters, conservation, evolutionary and ecological questions using deep learning models. For example, in the first chapter we focus on current regulations prohibiting the sale and distribution of hawksbill sea turtle derived products, which continues internationally in physical and online marketplaces. To curb the sale of illegal tortoiseshell, application of new technologies like convolutional neural networks (CNNs) is needed. Therein we describe a curated data set (n = 4,428) which was used to develop a CNN application we are calling “SEE Shell”, which can identify real and faux hawksbill derived products from image data. Developed on a MobileNetV2 using TensorFlow, SEE Shell was tested against a validation (n = 665) and test (n = 649) set where it achieved an accuracy between 82.6-92.2% correctness depending on the certainty threshold used. We expect SEE Shell will give potential buyers more agency in their purchasing decision, in addition to enabling retailers to rapidly filter their online marketplaces. In the second chapter we focus on recent research which utilized geometric morphometrics, associated genetic data, and Principal Component Analysis to successfully delineate Chelonia mydas (green sea turtle) morphotypes from carapace measurements. Therein we demonstrate a similar, yet more rapid approach to this analysis using computer vision models. We applied a U-Net to isolate carapace pixels of (n = 204) of juvenile C. mydas from multiple foraging grounds across the Eastern Pacific, Western Pacific, and Western Atlantic. These images were then sorted based on general alignment (shape) and coloration of the pixels within the image using a pre-trained computer vision model (MobileNetV2). The dimensions of these data were then reduced and projected using Universal Manifold Approximation and Projection. Associated vectors were then compared to simple genetic distance using a Mantel test. Data points were then labeled post-hoc for exploratory analysis. We found clear congruence between carapace morphology and genetic distance between haplotypes, suggesting that our image data have biological relevance. Our findings also suggest that carapace morphotype is associated with specific haplotypes within C. mydas. Our cluster analysis (k = 3) corroborates past research which suggests there are at least three morphotypes from across the Eastern Pacific, Western Pacific, and Western Atlantic. Finally, within the third chapter we discuss the sharp increase in agricultural and infrastructure development and the paucity of widespread data available to support conservation management decisions around the Amazon. To address these issues, we outline a more rapid and accurate tool for identifying fish fauna in the world's largest freshwater ecosystem, the Amazon. Current strategies for identification of freshwater fishes require high levels of training and taxonomic expertise for morphological identification or genetic testing for species recognition at a molecular level. To overcome these challenges, we built an image masking model (U-Net) and a CNN to mask and classify Amazonian fish in photographs. Fish used to generate training data were collected and photographed in tributaries in seasonally flooded forests of the upper Morona River valley in Loreto, Peru in 2018 and 2019. Species identifications in the training images (n = 3,068) were verified by expert ichthyologists. These images were supplemented with photographs taken of additional Amazonian fish specimens housed in the ichthyological collection of the Smithsonian’s National Museum of Natural History. We generated a CNN model that identified 33 genera of fishes with a mean accuracy of 97.9%. Wider availability of accurate freshwater fish image recognition tools, such as the one described here, will enable fishermen, local communities, and citizen scientists to more effectively participate in collecting and sharing data from their territories to inform policy and management decisions that impact them directly.Item National-Level Origin-Destination Estimation Based on Passively Collected Location Data and Machine Learning Methods(2021) Pan, Yixuan; Zhang, Lei; Civil Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Along with the development of information and positioning technologies, there emerges passively collected location data that contain location observations with time information from various types of mobile devices. Passive location data are known for their large sample size and continuous behavior observations. However, passive location data also require careful and comprehensive data processing and modeling algorithms for privacy protection and practical applications.In the meantime, the travel demand estimation of origin-destination tables is fundamental in transportation planning. There lacks a national origin-destination estimation that provides time-dependent travel behaviors for all travel modes. Passive collected location data appeal to researchers with the potential of serving as the data source for large-scale multimodal travel demand estimation and monitoring. This research proposes a comprehensive set of methods for passive location data processing including data cleaning, activity location and purpose identification, trip-level information identification, social demographic imputation, sample weighting and expansion, and demand validation. For each task, the thesis evaluates the state-of-the-practice and state-of-the-art algorithms, and develops an applicable method jointly considering the different features of various passive location data sources and the imputation accuracy. The thesis further examines the viability of the method kit in a national-level case study and successfully derives the national-level origin-destination estimates with additional data products, such as trip rate and vehicle miles traveled, at different geographic levels and temporal resolutions.Item Three Variations of Precision Medicine: Gene-Aware Genome Editing, Ancestry-Aware Molecular Diagnosis, and Clone-Aware Treatment Planning(2021) Sinha, Sanju; Ruppin, Eytan; Mount, Steve; Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)During my Ph.D., I developed several computational approaches to advance precision medicine for cancer prevention and treatment. My thesis presents three such approaches addressing these emerging challenges by analyzing large-scale cancer omics data from both pre-clinical models and patients datasets. In the first project, we studied the cancer risk associated with CRISPR-based therapies. Therapeutics based on CRISPR technologies (for which the chemistry Nobel prize was awarded in 2020) are poised to become widely applicable for treating a variety of human genetic diseases. However, preceding our work, two experimental studies have reported that genome editing by CRISPR–Cas9 can induce a DNA damage response mediated by p53 in primary cells hampering their growth. This could lead to an undesired selection of cells with pre-existing p53 mutations. Motivated by these findings, we conducted the first comprehensive computational and experimental investigation of the risk of CRISPR-induced selection of cancer gene mutants across many different cell types and lineages. I further studied whether this selection is dependent on the Cas9/sgRNA-delivery method and/or the gene being targeted. Importantly, we asked whether other cancer driver mutations may also be selected during CRISPR-Cas9 gene editing and identified that pre-existing KRAS mutants may also be selected for during CRISPR-Cas9 editing. In summary, we established that the risk of selection for pre-existing p53 or KRAS mutations is non-negligible, thus calling for careful monitoring of patients undergoing CRISPR-Cas9-based editing for clinical therapeutics for pre-existing p53 and KRAS mutations. In the second project, we aimed to delineate some of the molecular mechanisms that may underlie the observed differences in cancer incidences across cancer patients of different ancestries, focusing mainly on lung cancer. We found that lung tumors from African American (AA) patients exhibit higher genomic instability, homologous recombination deficiency, and aggressive molecular features such as chromothripsis. We next demonstrated that these molecular differences extend to many other cancer types. The prevalence of germline homologous recombination deficiency (HRD) is also higher in tumors from AAs, suggesting that at least some of the somatic differences observed may have genetic origins. Importantly, our findings provide a therapeutic strategy to treat tumors from AAs with high HRD, with agents such as PARP and checkpoint inhibitors, which is now further explored by our experimental collaborators. In the third project, we developed a new computational framework to leverage single-cell RNA-seq from patients’ tumors to guide optimal combination treatments that can target multiple clones in the tumor. We first showed that our predicted viability profile of multiple cancer drugs significantly correlates with their targeted pathway activity at a single-cell resolution, as one would expect. We apply this framework to predict the response to monotherapy and combination treatment in cell lines, patient-derived-cell lines, and most importantly, in a clinical trial of multiple myeloma patients. Following these validations, we next charted the landscape of optimal combination treatments of the existing FDA-approved drugs in multiple myeloma, providing a resource that could be used to potentially guide combination trials. Taken together, these results demonstrate the power of multi-omics analysis of cancer data to identify potential cancer risks and a strategy to mitigate, to shed light on molecular mechanisms underlying cancer disparity in AA patients, and point to possible ways to improve their treatment, and finally, we developed a new approach to treat cancer patients based on single-cell transcriptomics of their tumors.Item A COMPREHENSIVE EVALUATION OF FEATURE-BASED MALICIOUS WEBSITE DETECTION(2020) McGahagan , John Francis; Cukier,, Michel; Reliability Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Although the internet enables many important functions of modern life, it is also a ground for nefarious activity by malicious actors and cybercriminals. For example, malicious websites facilitate phishing attacks, malware infections, data theft, and disruption. A major component of cybersecurity is to detect and mitigate attacks enabled by malicious websites. Although prior researchers have presented promising results – specifically in the use of website features to detect malicious websites – malicious website detection continues to pose major challenges. This dissertation presents an investigation into feature-based malicious website detection. We conducted six studies on malicious website detection, with a focus on discovering new features for malicious website detection, challenging assumptions of features from prior research, comparing the importance of the features for malicious website detection, building and evaluating detection models over various scenarios, and evaluating malicious website detection models across different datasets and over time. We evaluated this approach on various datasets, including: a dataset composed of several threats from industry; a dataset derived from the Alexa top one million domains and supplemented with open source threat intelligence information; and a dataset consisting of websites gathered repeatedly over time. Results led us to postulate that new, unstudied, features could be incorporated to improve malicious website detection models, since, in many cases, models built with new features outperformed models built from features used in prior research and did so with fewer features. We also found that features discovered using feature selection could be applied to other datasets with minor adjustments. In addition: we demonstrated that the performance of detection models decreased over time; we measured the change of websites in relation to our detection model; and we demonstrated the benefit of re-training in various scenarios.Item Enabling Collaborative Visual Analysis across Heterogeneous Devices(2019) Badam, Sriram Karthik; Elmqvist, Niklas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)We are surrounded by novel device technologies emerging at an unprecedented pace. These devices are heterogeneous in nature: in large and small sizes with many input and sensing mechanisms. When many such devices are used by multiple users with a shared goal, they form a heterogeneous device ecosystem. A device ecosystem has great potential in data science to act as a natural medium for multiple analysts to make sense of data using visualization. It is essential as today's big data problems require more than a single mind or a single machine to solve them. Towards this vision, I introduce the concept of collaborative, cross-device visual analytics (C2-VA) and outline a reference model to develop user interfaces for C2-VA. This dissertation covers interaction models, coordination techniques, and software platforms to enable full stack support for C2-VA. Firstly, we connected devices to form an ecosystem using software primitives introduced in the early frameworks from this dissertation. To work in a device ecosystem, we designed multi-user interaction for visual analysis in front of large displays by finding a balance between proxemics and mid-air gestures. Extending these techniques, we considered the roles of different devices–large and small–to present a conceptual framework for utilizing multiple devices for visual analytics. When applying this framework, findings from a user study showcase flexibility in the analytic workflow and potential for generation of complex insights in device ecosystems. Beyond this, we supported coordination between multiple users in a device ecosystem by depicting the presence, attention, and data coverage of each analyst within a group. Building on these parts of the C2-VA stack, the culmination of this dissertation is a platform called Vistrates. This platform introduces a component model for modular creation of user interfaces that work across multiple devices and users. A component is an analytical primitive–a data processing method, a visualization, or an interaction technique–that is reusable, composable, and extensible. Together, components can support a complex analytical activity. On top of the component model, the support for collaboration and device ecosystems comes for granted in Vistrates. Overall, this enables the exploration of new research ideas within C2-VA.Item AIRSPACE PLANNING FOR OPTIMAL CAPACITY, EFFICIENCY, AND SAFETY USING ANALYTICS(2019) Ayhan, Samet; Samet, Hanan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Air Navigation Service Providers (ANSP) worldwide have been making a considerable effort for the development of a better method for planning optimal airspace capacity, efficiency, and safety. These goals require separation and sequencing of aircraft before they depart. Prior approaches have tactically achieved these goals to some extent. However, dealing with increasingly congested airspace and new environmental factors with high levels of uncertainty still remains the challenge when deterministic approach is used. Hence due to the nature of uncertainties, we take a stochastic approach and propose a suite of analytics models for (1) Flight Time Prediction, (2) Aircraft Trajectory Clustering, (3) Aircraft Trajectory Prediction, and (4) Aircraft Conflict Detection and Resolution long before aircraft depart. The suite of data-driven models runs on a scalable Data Management System that continuously processes streaming massive flight data to achieve the strategic airspace planning for optimal capacity, efficiency, and safety. (1) Flight Time Prediction. Unlike other systems that collect and use features only for the arrival airport to build a data-driven model for predicting flight times, we use a richer set of features along the potential route, such as weather parameters and air traffic data in addition to those that are particular to the arrival airport. Our feature engineering process generates an extensive set of multidimensional time series data which goes through Time Series Clustering with Dynamic Time Warping (DTW) to generate a single set of representative features at each time instance. The features are fed into various regression and deep learning models and the best performing models with most accurate ETA predictions are selected. Evaluations on extensive set of real trajectory, weather, and airport data in Europe verify our prediction system generates more accurate ETAs with far less variance than those of European ANSP, EUROCONTROL’s. This translates to more accurately predicted flight arrival times, enabling airlines to make more cost-effective ground resource allocation and ANSPs to make more efficient flight scheduling. (2) Aircraft Trajectory Clustering. The novel divide-cluster-merge; DICLERGE system clusters aircraft trajectories by dividing them into the three standard major flight phases: climb, en-route, and descent. Trajectory segments in each phase are clustered in isolation, then merged together. Our unique approach also discovers a representative trajectory, the model for the entire trajectory set. (3) Aircraft Trajectory Prediction. Our approach considers airspace as a 3D grid network, where each grid point is a location of a weather observation. We hypothetically build cubes around these grid points, so the entire airspace can be considered as a set of cubes. Each cube is defined by its centroid, the original grid point, and associated weather parameters that remain homogeneous within the cube during a period of time. Then, we align raw trajectories to a set of cube centroids which are basically fixed 3D positions independent of trajectory data. This creates a new form of trajectories which are 4D joint cubes, where each cube is a segment that is associated with not only spatio-temporal attributes but also with weather parameters. Next, we exploit machine learning techniques to train inference models from historical data and apply a stochastic model, a Hidden Markov Model (HMM), to predict trajectories taking environmental uncertainties into account. During the process, we apply time series clustering to generate input observations from an excessive set of weather parameters to feed into the Viterbi algorithm. The experiments use a real trajectory dataset with pertaining weather observations and demonstrate the effectiveness of our approach to the trajectory prediction process for Air Traffic Management. (4) Aircraft Conflict Detection. We propose a novel data-driven system to address a long-range aircraft conflict detection and resolution (CDR) problem. Given a set of predicted trajectories, the system declares a conflict when a protected zone of an aircraft on its trajectory is infringed upon by another aircraft. The system resolves the conflict by prescribing an alternative solution that is optimized by perturbing at least one of the trajectories involved in the conflict. To achieve this, the system learns from descriptive patterns of historical trajectories and pertinent weather observations and builds a Hidden Markov Model (HMM). Using a variant of the Viterbi algorithm, the system avoids the airspace volume in which the conflict is detected and generates a new optimal trajectory that is conflict-free. The key concept upon which the system is built is the assumption that the airspace is nothing more than a horizontally and vertically concatenated set of spatio-temporal data cubes where each cube is considered as an atomic unit. We evaluate the system using real trajectory datasets with pertinent weather observations from two continents and demonstrate its effectiveness for strategic CDR. Overall, in this thesis, we develop a suite of analytics models and algorithms to accurately identify current patterns in the massive flight data and use these patterns to predict future behaviors in the airspace. Upon prediction of a non-ideal outcome, we prescribe a solution to plan airspace for optimal capacity, efficiency, and safety.