Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
13 results
Search Results
Item Improving and validating computational algorithms for the assembly, clustering, and taxonomic classification of microbial communities(2024) Luan, Tu; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recent high-throughput sequencing technologies have advanced the study of microbial communities; nonetheless, analyzing the resulting large datasets still poses challenges. This dissertation focuses on developing and validating computational algorithms to address these challenges in microbial communities' assembly, clustering, and taxonomic classification. We first introduce a novel reference-guided metagenomic assembly approach that delivers high-quality assemblies that generally outperform \textit{de novo} assembly in terms of quality without a significant increase in runtime. Next, We propose SCRAPT, an iterative sampling-based algorithm designed to cluster 16S rRNA gene sequences from large datasets efficiently. In addition, we validate a comprehensive set of genome assembly pipelines using Oxford Nanopore sequencing, achieving near-perfect accuracy through the combination of long and short-read polishing tools. Our research improves the accuracy and efficiency of analyzing complex microbial communities. This dissertation offers insights into the composition and structures of these communities, with potential implications for human, animal, and plant health.Item A Comparative Study Of Outlier Detection Methods And Their Downstream Effects(2024) Adipudi, Vikram; Herrmann, Jeffrey W.; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)When fitting machine learning models on datasets there is a possibility of mistakes occurring with overfitting due to outliers in the dataset. Mistakes can lead to incorrect predictions from the model and could diminish the usefulness of the model. Outlier detection is conducted as a precursor step to avoid errors caused by this and to improve performance of the model. This study compares how different outlier detection methods impact regression, classification, and clustering methods. To identify which outlier detection performs best in conjunction with different tasks. To conduct this study multiple outlier detection algorithms were used to clean datasets and the cleaned data was fed into the models. The performance of the model with and without cleaning was compared to identify trends. This study found that using outlier detection of any kind will have little impact on supervised tasks such as regression and classification. For the unsupervised task different clustering models had outlier detection and removal algorithms that made the most positive impact in the clustering. Most commonly IForest and PCA had the greatest impact on clustering methods.Item On Algorithmic Fairness and Stochastic Models for Combinatorial Optimization and Unsupervised Machine Learning(2022) Tsepenekas, Leonidas; Srinivasan, Aravind; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Combinatorial optimization and unsupervised machine learning problems have been extensively studied and are relatively well-understood. Examples of such problems that play a central role in this work are clustering problems and problems of finding cuts in graphs. The goal of the research presented in this dissertation is to introduce novel variants of the aforementioned problems, by generalizing their classic variants into two, not necessarily disjoint, directions. The first direction involves incorporating fairness aspects to a problem's specifications, and the second involves the introduction of some form of randomness in the problem definition, e.g., stochastic uncertainty about the problem's parameters. Fairness in the design of algorithms and in machine learning has received a significant amount of attention during the last few years, mainly due to the realization that standard optimization approaches can frequently lead to severely unfair outcomes, that can potentially hurt the individuals or the groups involved in the corresponding application. As far as considerations of fairness are concerned, in this work we begin by presenting two novel individually-fair clustering models, together with algorithms with provable guarantees for them. The first such model exploits randomness in order to provide fair solutions, while the second is purely deterministic. The high-level motivation behind both of them is trying to treat similar individuals similarly. Moving forward, we focus on a graph cut problem that captures situations of disaster containment in a network. For this problem we introduce two novel fair variants. The first variant focuses on demographic fairness, while the second considers a probabilistic notion of individual fairness. Again, we give algorithms with provable guarantees for the newly introduced variants. In the next part of this thesis we turn our attention to generalizing problems through the introduction of stochasticity. At first, we present algorithmic results for a computational epidemiology problem, whose goal is to control the stochastic diffusion of a disease in a contact network. This problem can be interpreted as a stochastic generalization of a static graph cut problem. Finally, this dissertation also includes work on a well-known paradigm in stochastic optimization, namely the two-stage stochastic setting with recourse. Two-stage problems capture a wide variety of applications revolving around the trade-off between provisioning and rapid response. In this setting, we present a family of clustering problems that had not yet been studied in the literature, and for this family we show novel algorithmic techniques that provide constant factor approximation algorithms. We conclude the dissertation with a discussion on open problems and future research directions in the general area of algorithmic fairness.Item Computational approaches for improving the accuracy and efficiency of RNA-seq analysis(2020) Sarkar, Hirak N/A; Patro, Robert; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The past decade has seen tremendous growth in the area of high throughput sequencing technology, which simultaneously improved the biological resolution and subsequent processing of publicly-available sequencing datasets. This enormous amount of data also calls for better algorithms to process, extract and filter useful knowledge from the data. In this thesis, I concentrate on the challenges and solutions related to the processing of bulk RNA-seq data. An RNA-seq dataset consists of raw nucleotide sequences, drawn from the expressed mixture of transcripts in one or more samples. One of the most common uses of RNA-seq is obtaining transcript or gene level abundance information from the raw nucleotide read sequences and then using these abundances for downstream analyses such as differential expression. A typical computational pipeline for such processing broadly involves two steps: assigning reads to the reference sequence through alignment or mapping algorithms, and subsequently quantifying such assignments to obtain the expression of the reference transcripts or genes. In practice, this two-step process poses multitudes of challenges, starting from the presence of noise and experimental artifacts in the raw sequences to the disambiguation of multi-mapped read sequences. In this thesis, I have described these problems and demonstrated efficient state-of-the-art solutions to a number of them. The current thesis will explore multiple uses for an alternate representation of an RNA-seq experiment encoded in equivalence classes and their associated counts. In this representation, instead of treating a read fragment individually, multiple fragments are simultaneously assigned to a set of transcripts depending on the underlying characteristics of the read-to-transcript mapping. I used the equivalence classes for a number of applications in both single-cell and bulk RNA-seq technologies. By employing equivalence classes at cellular resolution, I have developed a droplet-based single-cell RNA-seq sequence simulator capable of generating tagged end short read sequences resembling the properties of real datasets. In bulk RNA-seq, I have utilized equivalence classes to applications ranging from data-driven compression methodologies to clustering de-novo transcriptome assemblies. Specifically, I introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. This approach, implemented in the tool terminus, groups together transcripts in a data-driven manner. It leverages the equivalence class factorization to quickly identify transcripts that share reads and posterior samples to measure the confidence of the point estimates. As a result, terminus allows transcript-level analysis where it can be confidently supported, and derives transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.Item Deep Learning with Constraints and Priors for Improved Subject Clustering, Medical Imaging, and Robust Inference(2020) Lin, Wei-An; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep neural networks (DNNs) have achieved significant success in several fields including computer vision, natural language processing, and robot control. The common philosophy behind these success is the use of large amount of annotated data and end-to-end networks with task-specific constraints and priors implicitly incorporated into the trained model without the need for careful feature engineering. However, DNNs are shown to be vulnerable to distribution shifts and adversarial perturbations, which indicates that such implicit priors and constraints are not sufficient for real world applications. In this dissertation, we target three applications and design task-specific constraints and priors for improved performance of deep neural networks. We first study the problem of subject clustering, the task of grouping face images of the same person together. We propose to utilize the prior structure in the feature space of DNNs trained for face identification to design a novel clustering algorithm. Specifically, the clustering algorithm exploits the local neighborhood structure of deep representations by exemplar-based learning based on k-nearest neighbors (k-NN). Extensive experiments show promising results for grouping face images according to subject identity. As an example, we apply the proposed clustering algorithm to automatically curate a large-scale face dataset with noisy labels and show that the performance of face recognition DNNs can be significantly improved by training on the curated dataset. Furthermore, we empirically find that the k-NN rule does not capture proper local structures for deep representations when each subject has very few face images. We then propose to improve upon the exemplar-based approach by a density-aware similarity measure and theoretically show its asymptotic convergence to a density estimator. We conduct experiments on challenging face datasets that show promising results. Second, we study the problem of metal artifact reduction in computed tomography (CT). Unlike typical image restoration tasks such as super-resolution and denoising, metal artifacts in CT images are structured and non-local. Conventional DNNs do not generalize well when metal implants with unseen shapes are presented. We find that the imaging process of CT induces a data consistency prior that can be exploited for image enhancement. Based on this observation, we propose a dual-domain learning approach to CT metal artifact reduction. We design and implement a novel Radon inversion layer that allows gradients in the image domain to be backpropagated to the projection domain. Experiments conducted on both simulated datasets and clinical datasets show promising results. Compared to conventional DNN-based models, the proposed dual-domain approach leads to impressive metal artifact reduction and has improved generalization capability. Finally, we study the problem of robust classification. In the past few years, the vulnerability of DNNs to small imperceptible perturbations has been widely studied, which raises concerns about the security and robustness of DNNs against possible threat models. To defend against threat models, Samangoui et al. proposed DefenseGAN, a preprocessing approach which removes adversarial perturbations by projecting the input images onto the learned data prior. However, the projection operation in DefenseGAN is time-consuming and may not yield proper reconstruction when images have complicated textures. We propose an inversion network to constrain the initial estimates of the latent code for input images. With the proposed constraint, the number of optimization steps in DefenseGAN can be reduced while achieving improved accuracy and robustness. Furthermore, we conduct empirical studies on attack methods that have claimed to break DefenseGAN, which shows that on-manifold robustness might be the key factor for ensuring adversarial robustness.Item OPTIMIZATION ALGORITHMS USING PRIORS IN COMPUTER VISION(2018) Shah, Sohil Atul; Goldstein, Tom; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Over the years, many computer vision models, some inspired by human behavior, have been developed for various applications. However, only handful of them are popular and widely used. Why? There are two major factors: 1) most of these models do not have any efficient numerical algorithm and hence they are computationally very expensive; 2) many models, being too generic, cannot capitalize on problem specific prior information and thus demand rigorous hyper-parameter tuning. In this dissertation, we design fast and efficient algorithms to leverage application specific priors to solve unsupervised and weakly-supervised problems. Specifically, we focus on developing algorithms to impose structured priors, model priors and label priors during the inference and/or learning of vision models. In many application, it is known a priori that a signal is smooth and continuous in space. The first part of this work is focussed on improving unsupervised learning mechanisms by explicitly imposing these structured priors in an optimization framework using different regularization schemes. This led to the development of fast algorithms for robust recovery of signals from compressed measurements, image denoising and data clustering. Moreover, by employing re-descending robust penalty on the structured regularization terms and applying duality, we reduce our clustering formulation to an optimization of a single continuous objective. This enabled integration of clustering processes in an end-to-end feature learning pipeline. In the second part of our work, we exploit inherent properties of established models to develop efficient solvers for SDP, GAN, and semantic segmentation. We consider models for several different problem classes. a) Certain non-convex models in computer vision (e.g., BQP) are popularly solved using convex SDPs after lifting to a high-dimensional space. However, this computationally expensive approach limits these methods to small matrices. A fast and approximate algorithm is developed that directly solves the original non-convex formulation using biconvex relaxations and known rank information. b) Widely popular adversarial networks are difficult to train as they suffer from instability issues. This is because optimizing adversarial networks corresponds to finding a saddle-point of a loss function. We propose a simple prediction method that enables faster training of various adversarial networks using larger learning rates without any instability problems. c) Semantic segmentation models must learn long-distance contextual information while retaining high spatial resolution at the output. Existing models achieves this at the cost of computationally expensive and memory exhaustive training/inference. We designed stacked u-nets model which can repeatedly process top-down and bottom-up features. Our smallest model exceeds Resnet-101 performance on PASCAL VOC 2012 by 4.5% IoU with ∼ 7× fewer parameters. Next, we address the problem of learning heterogeneous concepts from internet videos using mined label tags. Given a large number of videos each with multiple concepts and labels, the idea is to teach machines to automatically learn these concepts by leveraging weak labels. We formulate this into a co-clustering problem and developed a novel bayesian non-parametric weakly supervised Indian buffet process model which additionally enforces the paired label prior between concepts. In the final part of this work we consider an inverse approach: learning data priors from a given model. Specifically, we develop numerically efficient algorithm for estimating the log likelihood of data samples from GANs. The approximate log-likelihood function is used for outlier detection and data augmentation for training classifiers.Item Identifying and Comparing Subproblems in Factory Design Processes(2018) Kanagat, Pranay; Herrmann, Jeffrey W; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)When a design team faces a problem of designing a complex system, they are required to make several decisions. Because such design problems are difficult to solve all at once, teams often decompose the design problem into several smaller subproblems. This thesis discusses the results of a study designed to understand how design teams decompose a factory redesign problem into sets of related subproblems and compare the subproblems obtained for each design team. This exploratory study analyzed the design activities of six teams of professionals and used clustering to group the variables that the design teams considered. It was found that the design teams used different decomposition strategies and different subproblems, but they more often considered subproblems with design variables of the same type, and some teams followed a top-down design process.Item Expressive Knowledge Resources in Probabilistic Models(2014) Hu, Yuening; Boyd-Graber, Jordan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Understanding large collections of unstructured documents remains a persistent problem. Users need to understand the themes of a corpus and to explore documents of interest. Topic models are a useful and ubiquitous tool to discover the main themes (namely topics) of the corpus. Topic models have been successfully applied in natural language processing, computer vision, information retrieval, cognitive science, etc. However, the discovered topics are not always meaningful: some topics confuse two or more themes into one topic; two different topics can be near duplicates; and some topics make no sense at all. Adding knowledge resources into topic models can improve the topics. However, how to encode knowledge into topic models and where to find these knowledge resources remain two scientific challenges. To address these problems, this thesis presents tree-based topic models to encode prior knowledge, a mechanism incorporating knowledge from untrained users, a polylingual tree-based topic model based on existing dictionaries as knowledge resources, an exploration of regularizing spectral methods to encode prior knowledge into topic models, and a model for automatically building hierarchies of prior knowledge for topic models. To encode knowledge resources into topic models, we first present tree-based topic models, where correlations between word types are modeled as a prior tree and applied to topic models. We also develop more efficient inference algorithms for tree- based topic models. Experiments on multiple corpora show that efficiency is greatly improved on different number of topics, number of correlations and vocabulary size. Because users decide whether the topics are useful or not, users' feedback is necessary for effective topic modeling. We thus propose a mechanism for giving normal users a voice to topic models by encoding users' feedback as correlations between word types into tree-based topic models. This framework, interactive topic modeling (ITM), allows untrained users to encode their feedback easily and iteratively into the topic models. We validate the framework both with simulated and real users and discuss strategies for improving the user experience to adapt models to what users need. Existing knowledge resources such as dictionaries can also improve the model. We propose polylingual tree-based topic models based on bilingual dictionaries and apply this model to domain adaptation for statistical Machine Translation. We derive three different inference schemes and evaluate the efficacy of our model on a Chinese to English translation system, and obtain up to 1.2 BLEU improvement over the machine translation baseline. This thesis further explores an alternative way--regularizing spectral methods for topic models--to encode prior knowledge into topic models. Spectral methods offer scalable alternatives to Markov chain Monte Carlo and expectation maximization. However, these new methods lack the priors that are associated with probabilistic models. We examine Arora et al.'s anchor algorithm for topic models and encode prior knowledge by regularizing the anchor algorithm to improve the interpretability and generalizability of topic models. Because existing knowledge resources are limited and because obtaining the knowledge from users is expensive and time-consuming, automatic techniques should also be considered to extract knowledge from the corpus. This thesis further presents a Bayesian hierarchical clustering technique with the Beta coalescent, which provides a possible way to build up the prior tree automatically. Because of its computational complexity, we develop new sampling schemes using sequential Monte carlo and Dirichlet process mixture models, which render the inference practical and efficient. This thesis explores sources of prior knowledge, presents different ways to encode these expressive knowledge resources into probabilistic topic models, and also applies these models in translation domain adaptation. We also discuss further extensions in a bigger picture of interactive machine learning techniques and domain adaptation for downstream tasks.Item Semi-supervised and Active Image Clustering with Pairwise Constraints from Humans(2014) Biswas, Arijit; Jacobs, David W.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Clustering images has been an interesting problem for computer vision and machine learning researchers for many years. However as the number of categories increases, image clustering becomes extremely hard and is not possible to use for many practical applications. Researchers have proposed several methods that use semi-supervision from humans to improve clustering. Constrained clustering, where users indicate whether an image pair belong to the same category or not, is a well-known paradigm for semi-supervision. Past research has shown that pairwise constraints have the potential to significantly improve clustering performance. There are two major components to constrained clustering research: how pairwise constraints can be used to improve clustering (e.g: constrained clustering algorithms, distance or metric learning methods) and determining which constraints are most useful for improving clustering (e.g.: active or interactive clustering methods). In this thesis we propose three different approaches to improve pairwise constrained clustering spanning both of these components. First, we propose a distance learning method in non-vector spaces, where the triangle inequality is used to propagate the pairwise constraints to the unsupervised image pairs. This approach can work with any pairwise distance and does not require any vector representation of images. Second, we propose an algorithm for active image pair selection. A novel method is developed to choose the most useful pairs to show a person, obtaining constraints that improve clustering. Third, we study how pairwise constraints can effectively be used to cluster large image datasets. Complete clustering of large datasets requires an extremely large number of pairwise constraints and may not be feasible in practice. We propose a new algorithm to cluster a subset of the images only (we call this subclustering), which will produce a few examples from each class. Subclustering will produce smaller but purer clusters and can be used for summarization, category discovery, browsing, image search, etc.... Finally, we make use of human input in an active subclustering algorithm to further improve results. We perform experiments on several real world datasets such as faces, leaves, videos and scenes and empirically show that our approaches can advance the state-of-the-art in clustering.Item Efficient Machine Learning Methods for Document Image Analysis(2013) Kumar, Jayant; Davis, Larry; Doermann, David; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)With the exponential growth in volume of multimedia content on the internet, there has been an increasing interest for developing more efficient and scalable algorithms to learn directly from data without excessive restrictions on nature of the content. In the context of document images, many large scale digitization projects have called for reliable and scalable triage methods for enhancement, segmentation, grouping and categorization of captured images. Current approaches, however, are typically limited to a specific class of documents such as scanned books, newspapers, journal articles or forms for example, and analysis and processing of more unconstrained and noisy heterogeneous document collections has not been as widely addressed. Additionally, existing machine-learning based approaches for document processing need to be carefully applied to handle the challenges associated with large and imbalanced training data. In this thesis, we address these challenges in three primary applications of document image analysis - low-level document enhancement, mid-level handwritten line segmentation, and high-level classification and retrieval. We first present a data selection method for training Support Vector Machines (SVM) on large-scale data sets. We apply the proposed approach to pixel-level document image enhancement, and show promising results with a relatively small number of training samples. Second, we present a graph-based method for segmentation of handwritten document images into text-lines which is more efficient and adaptive than previous approaches. Our approach demonstrates that combining results from local and global methods enhances the final performance of text-line segmentation. Third, we present an approach to compute structural similarities between images for classification and retrieval. Results on real-world data sets show that the approach is more effective than earlier approaches when the labeled data is limited. We extend our classification approach to a completely unsupervised setting, where both the number of classes and representative samples from each class is assumed to be unknown. We present a method for computing similarities based on learned structural patterns and correlations from the given data. Experiments with four different data sets show that our approach can estimate number of classes in large document collections and group structurally similar images with a high-accuracy.