Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
9 results
Search Results
Item Algorithms and Data Structures for Faster Nearest-Neighbor Classification(2022) Flores Velazco, Alejandro; Mount, David; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Given a set P of n labeled points in a metric space (X,d), the nearest-neighbor rule classifies an unlabeled query point q ∈ X with the class of q's closest point in P. Despite the advent of more sophisticated techniques, nearest-neighbor classification is still fundamental for many machine-learning applications. Over the years, this~has motivated numerous research aiming to reduce its high dependency on the size and dimensionality of the data. This dissertation presents various approaches to reduce the dependency of the nearest-neighbor rule from n to some smaller parameter k, that describes the intrinsic complexity of the class boundaries of P. This is of particular significance as it is usually assumed that k ≪ n on real-world training sets. One natural way to achieve this dependency reduction is to reduce the training set itself, selecting a subset R ⊆ P to be used by the nearest-neighbor rule~to~answer incoming queries, instead of using P. Evidently, this approach would reduce the dependencies of the nearest-neighbor rule from n, the size of P, to the size of R. This dissertation explores different techniques to select subsets whose sizes are proportional to k, and that provide varying degrees of correct classification guarantees. Another alternative involves bypassing training set reduction, and instead building data structures designed to answer classification queries directly. To this end, this dissertation proposes the Chromatic AVD; a Quadtree-based data structure designed to answer ε-approximate nearest-neighbor classification queries. The query time and space complexities of this data structure depend on k_ε; a generalization of k that describes the intrinsic complexity of the ε-approximate class boundaries of P.Item DEVELOPMENT OF A BIOINFORMATICS PLASMID-SEARCH ENGINE FOR CRONOBACTER SPECIES.(2021) Negrete, Flavia; El-Sayeed, Najib NE; Biology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Cronobacter species. are foodborne pathogens that cause serious disease in neonates, infants, and adults. Plasmid classification lays the groundwork for understanding the stable coexistence of various extrachromosomal replicons in a single bacterium, and thus the organization of its genome. This study developed a bioinformatics plasmid-search engine to identify genomic attributes contained on Cronobacter plasmids. A database containing 32 Cronobacter plasmid sequences from all seven Cronobacter species was developed. Another database containing 683 draft and closed plasmids and genomes was also developed. Each strain’s plasmid content was sorted into six different categories based on their genetic attributes: virulence, Type-IV, heavy-metal, cryptic, multi-drug resistant, or mobilization. An in-house BLAST+-python script was used to perform a Linux-BLAST analysis to create a formatted %ID output matrix of plasmid genes. This thesis represents the first bioinformatics plasmid-search engine developed for Cronobacter. Understanding the role of plasmids in virulence and persistence underpins future mitigation strategies to be developed for controlling this pathogen.Item Rich and Scalable Models for Text(2019) nguyen, thang dai; Boyd-Graber, Jordan; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity. First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge. Second, we enrich the anchor word algorithm with metadata-based word representation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models. Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO.Item Recognizing Visual Categories by Commonality and Diversity(2015) Choi, Jonghyun; Davis, Larry Steven; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Visual categories refer to categories of objects or scenes in the computer vision literature. Building a well-performing classifier for visual categories is challenging as it requires a high level of generalization as the categories have large within class variability. We present several methods to build generalizable classifiers for visual categories by exploiting commonality and diversity of labeled samples and the cat- egory definitions to improve category classification accuracy. First, we describe a method to discover and add unlabeled samples from auxil- iary sources to categories of interest for building better classifiers. In the literature, given a pool of unlabeled samples, the samples to be added are usually discovered based on low level visual signatures such as edge statistics or shape or color by an unsupervised or semi-supervised learning framework. This method is inexpensive as it does not require human intervention, but generally does not provide useful information for accuracy improvement as the selected samples are visually similar to the existing set of samples. The samples added by active learning, on the other hand, provide different visual aspects to categories and contribute to learning a better classifier, but are expensive as they need human labeling. To obtain high quality samples with less annotation cost, we present a method to discover and add samples from unlabeled image pools that are visually diverse but coherent to cat- egory definition by using higher level visual aspects, captured by a set of learned attributes. The method significantly improves the classification accuracy over the baselines without human intervention. Second, we describe now to learn an ensemble of classifiers that captures both commonly shared information and diversity among the training samples. To learn such ensemble classifiers, we first discover discriminative sub-categories of the la- beled samples for diversity. We then learn an ensemble of discriminative classifiers with a constraint that minimizes the rank of the stacked matrix of classifiers. The resulting set of classifiers both share the category-wide commonality and preserve diversity of subcategories. The proposed ensemble classifier improves recognition accuracy significantly over the baselines and state-of-the-art subcategory based en- semble classifiers, especially for the challenging categories. Third, we explore the commonality and diversity of semantic relationships of category definitions to improve classification accuracy in an efficient manner. Specif- ically, our classification model identifies the most helpful relational semantic queries to discriminatively refine the model by a small amount of semantic feedback in inter- active iterations. We improve the classification accuracy on challenging categories that have very small numbers of training samples via transferred knowledge from other related categories that have a lager number of training samples by solving a semantically constrained transfer learning optimization problem. Finally, we summarize ideas presented and discuss possible future work.Item FACE RECOGNITION AND VERIFICATION IN UNCONSTRAINED ENVIRIONMENTS(2012) Guo, Huimin; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition has been a long standing problem in computer vision. General face recognition is challenging because of large appearance variability due to factors including pose, ambient lighting, expression, size of the face, age, and distance from the camera, etc. There are very accurate techniques to perform face recognition in controlled environments, especially when large numbers of samples are available for each face (individual). However, face identification under uncontrolled( unconstrained) environments or with limited training data is still an unsolved problem. There are two face recognition tasks: face identification (who is who in a probe face set, given a gallery face set) and face verification (same or not, given two faces). In this work, we study both face identification and verification in unconstrained environments. Firstly, we propose a face verification framework that combines Partial Least Squares (PLS) and the One-Shot similarity model[1]. The idea is to describe a face with a large feature set combining shape, texture and color information. PLS regression is applied to perform multi-channel feature weighting on this large feature set. Finally the PLS regression is used to compute the similarity score of an image pair by One-Shot learning (using a fixed negative set). Secondly, we study face identification with image sets, where the gallery and probe are sets of face images of an individual. We model a face set by its covariance matrix (COV) which is a natural 2nd-order statistic of a sample set.By exploring an efficient metric for the SPD matrices, i.e., Log-Euclidean Distance (LED), we derive a kernel function that explicitly maps the covariance matrix from the Riemannian manifold to Euclidean space. Then, discriminative learning is performed on the COV manifold: the learning aims to maximize the between-class COV distance and minimize the within-class COV distance. Sparse representation and dictionary learning have been widely used in face recognition, especially when large numbers of samples are available for each face (individual). Sparse coding is promising since it provides a more stable and discriminative face representation. In the last part of our work, we explore sparse coding and dictionary learning for face verification application. More specifically, in one approach, we apply sparse representations to face verification in two ways via a fix reference set as dictionary. In the other approach, we propose a dictionary learning framework with explicit pairwise constraints, which unifies the discriminative dictionary learning for pair matching (face verification) and classification (face recognition) problems.Item Cost-sensitive Information Acquisition in Structured Domains(2010) Bilgic, Mustafa; Getoor, Lise C; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Many real-world prediction tasks require collecting information about the domain entities to achieve better predictive performance. Collecting the additional information is often a costly process that involves acquiring the features describing the entities and annotating the entities with target labels. For example, document collections need to be manually annotated for classification and lab tests need to be ordered for medical diagnosis. Annotating the whole document collection and ordering all possible lab tests might be infeasible due to limited resources. In this thesis, I explore effective and efficient ways of choosing the right features and labels to acquire under limited resources. For the problem of feature acquisition, we are given entities with missing features and the task is to classify them with minimum cost. The likelihood of misclassification can be reduced by acquiring features but acquiring features incurs costs as well. The objective is to acquire the right set of features that balance acquisition and misclassification cost. I introduce a technique that can reduce the space of possible sets of features to consider for acquisition by exploiting the conditional independence properties in the underlying probability distribution. For the problem of label acquisition, I consider two real-world scenarios. In the first one, we are given a previously trained model and a budget determining how many labels we can acquire, and the objective is to determine the right set of labels to acquire so that the accuracy on the remaining ones is maximized. I describe a system that can automatically learn and predict on which entities the underlying classifier is likely to make mistakes and it suggests acquiring the labels of the entities that lie in a high density potentially-misclassified region. In the second scenario, we are given a network of entities that are unlabeled and our objective is to learn a classification model that will have the least future expected error by acquiring minimum number of labels. I describe an active learning technique that can exploit the relationships in the network both to select informative entities to label and to learn a collective classifier that utilizes the label correlations in the network.Item Automated quantification and classification of human kidney microstructures obtained by optical coherence tomography(2009) Li, Qian; Chen, Yu; Bioengineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Optical coherence tomography (OCT) is a rapidly emerging imaging modality that can non-invasively provide cross-sectional, high-resolution images of tissue morphology such as kidney in situ and in real-time. Because the viability of a donor kidney is closely correlated with its tubular morphology, and a large amount of image datasets are expected when using OCT to scan the entire kidney, it is necessary to develop automated image analysis methods to quantify the spatially-resolved morphometric parameters such as tubular diameter, and to classify various microstructures. In this study, we imaged the human kidney in vitro, quantified the diameters of hollow structures such as blood vessels and uriniferous tubules, and classified those structures automatically. The quantification accuracy was validated. This work can enable studies to determine the clinical utility of OCT for kidney imaging, as well as studies to evaluate kidney morphology as a biomarker for assessing kidney's viability prior to transplantation.Item Classifying Attitude by Topic Aspect for English and Chinese Document Collections(2008-04-25) Wu, Yejun; Oard, Douglas W.; Library & Information Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by comparing attitudes on aspects of a topic in English and Chinese document collections. This involves two coupled challenges: topic aspect focus and attitude characterization. The topic aspect focus is specified by using information retrieval techniques to obtain documents on a topic that are of interest to a user and then allowing the user to designate a few segments of those documents to serve as examples for aspects that she wishes to see characterized. A novel feature of this work is that the examples can be drawn from documents in two languages (English and Chinese). A bilingual aspect classifier which applies monolingual and cross-language classification techniques is used to assemble automatically a large set of document segments on those same aspects. A test collection was designed for aspect classification by annotating consecutive sentences in documents from the Topic Detection and Tracking collections as aspect instances. Experiments show that classification effectiveness can often be increased by using training examples from both languages. Attitude characterization is achieved by classifiers which determine the subjectivity and polarity of document segments. Sentence attitude classification is the focus of the experiments in the dissertation because the best presently available test collection for Chinese attitude classification (the NTCIR-6 Chinese Opinion Analysis Pilot Task) is focused on sentence-level classification. A large Chinese sentiment lexicon was constructed by leveraging existing Chinese and English lexical resources, and an existing character-based approach for estimating the semantic orientation of other Chinese words was extended. A shallow linguistic analysis approach was adopted to classify the subjectivity and polarity of a sentence. Using the large sentiment lexicon with appropriate handling of negation, and leveraging sentence subjectivity density, sentence positivity and negativity, the resulting sentence attitude classifier was more effective than the best previously reported systems.Item Global Phenomena from Local Rules: Peer-to-Peer Networks and Crystal Steps(2007-12-04) Finkbiner, Amy; Yorke, James A; Margetis, Dionisios; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Even simple, deterministic rules can generate interesting behavior in dynamical systems. This dissertation examines some real world systems for which fairly simple, locally defined rules yield useful or interesting properties in the system as a whole. In particular, we study routing in peer-to-peer networks and the motion of crystal steps. Peers can vary by three orders of magnitude in their capacities to process network traffic. This heterogeneity inspires our use of "proportionate load balancing," where each peer provides resources in proportion to its individual capacity. We provide an implementation that employs small, local adjustments to bring the entire network into a global balance. Analytically and through simulations, we demonstrate the effectiveness of proportionate load balancing on two routing methods for de Bruijn graphs, introducing a new "reversed" routing method which performs better than standard forward routing in some cases. The prevalence of peer-to-peer applications prompts companies to locate the hosts participating in these networks. We explore the use of supervised machine learning to identify peer-to-peer hosts, without using application-specific information. We introduce a model for "triples," which exploits information about nearly contemporaneous flows to give a statistical picture of a host's activities. We find that triples, together with measurements of inbound vs. outbound traffic, can capture most of the behavior of peer-to-peer hosts. An understanding of crystal surface evolution is important for the development of modern nanoscale electronic devices. The most commonly studied surface features are steps, which form at low temperatures when the crystal is cut close to a plane of symmetry. Step bunching, when steps arrange into widely separated clusters of tightly packed steps, is one important step phenomenon. We analyze a discrete model for crystal steps, in which the motion of each step depends on the two steps on either side of it. We find an time-dependence term for the motion that does not appear in continuum models, and we determine an explicit dependence on step number.