UMD Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/3
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
11 results
Search Results
Item DEVELOPMENT AND APPLICATION OF PROPINQUITY MODELING FRAMEWORK FOR IDENTIFICATION AND ANALYSIS OF EXTREME EVENT PATTERNS(2024) kholodovsky, vitaly; Liang, Xin-Zhong; Atmospheric and Oceanic Sciences; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Extreme weather and climate events such as floods, droughts, and heat waves can cause extensive societal damage. While various statistical and climate models have been developed for the purpose of simulating extremes, a consistent definition of extreme events is still lacking. Furthermore, to better assess the performance of the climate models, a variety of spatial forecast verification measures have been developed. However, in most cases, the spatial verification measures that are widely used to compare mean states do not have sufficient theoretical justification to benchmark extreme events. In order to alleviate inconsistencies when defining extreme events within different scientific communities, we propose a new generalized Spatio-Temporal Threshold Clustering method for the identification of extreme event episodes, which uses machine learning techniques to couple existing pattern recognition indices with high or low threshold choices. The method consists of five main steps: construction of essential field quantities, dimension reduction, spatial domain mapping, time series clustering, and threshold selection. We develop and apply this method using a gridded daily precipitation dataset derived from rain gauge stations over the contiguous United States. We observe changes in the distribution of conditional frequency of extreme precipitation from large-scale, well-connected spatial patterns to smaller-scale, more isolated rainfall clusters, possibly leading to more localized droughts and heatwaves, especially during the summer months. Additionally, we compare empirical and statistical probabilities and intensities obtained through the Conventional Location Specific methods, which are deficient in geometric interconnectivity between individual spatial pixels and independent in time, with a new Propinquity modeling framework. We integrate the Spatio-Temporal Threshold Clustering algorithm and the conditional semi-parametric Heffernan and Tawn (2004) model into the Propinquity modeling framework to separate classes of models that can calculate process level dependence of large-scale extreme processes, primarily through the overall extreme spatial field. Our findings reveal significant differences between Propinquity and Conventional Location Specific methods, in both empirical and statistical approaches in shape and trend direction. We also find that the process of aggregating model results without considering interconnectivity between individual grid cells for trend construction can lead to significant variations in the overall trend pattern and direction compared with models that do account for interconnectivity. Based on these results, we recommend avoiding such practices and instead adopting the Propinquity modeling framework or other spatial EVA models that take into account the interconnectivity between individual grid cells. Our aim for the final application is to establish a connection between extreme essential field quantity intensity fields and large-scale circulation patterns. However, the Conventional Location Specific Threshold methods are not appropriate for this purpose as they are memoryless in time and not able to identify individual extreme episodes. To overcome this, we developed the Feature Finding Decomposition algorithm and used it in combination with the Propinquity modeling framework. The algorithm consists of the following three steps: feature finding, image decomposition, and large-scale circulation patterns connection. Our findings suggest that the Western Pacific Index, particularly its 5th percentile and 5th mode of decomposition, is the most significant teleconnection pattern that explains the variation in the trend pattern of the largest feature intensity.Item AN INTEGER PROGRAMMING MODEL FOR DYNAMIC TAXI-SHARING CONSIDERING PROVIDER PROFIT(2018) Hao, Yeming; Haghani, Ali; Civil Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis proposes an integer programming model for Dynamic Taxi-Sharing (DTS), which allows two groups of taxi users to ride on the same taxi together. The model matches taxi drivers and user pairs in certain sequences with the goal of maximizing taxi providers’ profit. We also develop a DTS fare calculation scheme which can automatically calculate the fare for each DTS user and self-adjust to balance the taxi occupancy rate in real time. A customized spectral clustering approach for preselection on DTS trips is also designed to narrow down the search space for the model. Real-world taxi trip data is used to demonstrate the DTS system is beneficial to providers, taxi users, and taxi drivers.Item MACHINERY ANOMALY DETECTION UNDER INDETERMINATE OPERATING CONDITIONS(2018) Tian, Jing; Pecht, Michael; Mechanical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Anomaly detection is a critical task in system health monitoring. Current practice of anomaly detection in machinery systems is still unsatisfactory. One issue is with the use of features. Some features are insensitive to the change of health, and some are redundant with each other. These insensitive and redundant features in the data mislead the detection. Another issue is from the influence of operating conditions, where a change in operating conditions can be mistakenly detected as an anomalous state of the system. Operating conditions are usually changing, and they may not be readily identified. They contribute to false positive detection either from non-predictive features driven by operating conditions, or from influencing predictive features. This dissertation contributes to the reduction of false detection by developing methods to select predictive features and use them to span a space for anomaly detection under indeterminate operating conditions. Available feature selection methods fail to provide consistent results when some features are correlated. A method was developed in this dissertation to explore the correlation structure of features and group correlated features into the same clusters. A representative feature from each cluster is selected to form a non-correlated set of features, where an optimized subset of predictive features is selected. After feature selection, the influence of operating conditions through non-predictive variables are removed. To remove the influence on predictive features, a clustering-based anomaly detection method is developed. Observations are collected when the system is healthy, and these observations are grouped into clusters corresponding to the states of operating conditions with automatic estimation of clustering parameters. Anomalies are detected if the test data are not members of the clusters. Correct partitioning of clusters is an open challenge due to the lack of research on the clustering of the machinery health monitoring data. This dissertation uses unimodality of the data as a criterion for clustering validation, and a unimodality-based clustering method is developed. Methods of this dissertation were evaluated by simulated data, benchmark data, experimental study and field data. These methods provide consistent results and outperform representatives of available methods. Although the focus of this dissertation is on the application of machinery systems, the methods developed in this dissertation can be adapted for other application scenarios for anomaly detection, feature selection, and clustering.Item Spatial and temporal modeling of large-scale brain networks(2017) Najafi, Mahshid; Pessoa, Luiz; Simon, Jonathan Z.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The human brain is the most fascinating and complex organ. It directs all our actions and thoughts. Despite the large body of brain studies, little is known about the neural basis of its large-scale structure. In this dissertation, I take advantage of several network-based and statistical techniques to investigate the spatial and temporal aspects of large-scale functional networks of the human brain during "rest" and "task" conditions using functional MRI data. Large-scale analysis of human brain function has revealed that brain regions can be grouped into networks or communities. Most studies adopt a framework in which brain regions belong to only one community. Yet studies in general fields of knowledge suggest that in most cases complex networks consist of interwoven sets of overlapping communities. A mixed-membership framework can better characterize the complex networks. In this dissertation, I employed a mixed-membership Bayesian model to characterize overlapping community structure of the brain at both "rest" and "task" conditions. The approach allowed us to quantify how task performance reconfigures brain communities at rest, and determine the relationship between functional diversity (how diverse is a region's functional activation repertoire) and membership diversity (how diverse is a region's affiliation to communities). Furthermore, I could study the distribution of key regions, named "bridges", in transferring information across the brain communities. Our findings revealed that the overlapping framework described the brain in ways that were not captured by disjoint clustering, and thus provided a richer landscape of large-scale brain networks. Overall, I suggest that overlapping networks are better suited to capture the flexible and task-dependent mapping between brain regions and their functions. Finally, I developed a dynamic intersubject network analysis technique to study the temporal changes of the emotional brain at the level of large-scale brain networks by formulating a manipulation in which threat levels varied continuously during the experiment. Our results illustrate that cohesion within and between networks changed dynamically with threat level. Together, our findings reveal that characterizing emotional processing should be done at the level of distributed networks, and not simply at the level of evoked responses in specific brain regions.Item EVALUATING CLUSTERING ALGORITHMS TO IDENTIFY SUBPROBLEMS IN DESIGN PROCESSES(2017) Morency, Michael John; Herrmann, Jeffrey W; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Design problems are inherently intricate and require multiple dependent decisions. Because of these characteristics, design teams generally choose to decompose the main problem into manageable subproblems. This thesis describes the results of a study designed to (a) explore clustering algorithms as a new and repeatable way to identify subproblems in recorded design team discussions, (b) assess the quality of the identified subproblems, and (c) examine any relationships between the subproblems and final design or team experience level. We observed five teams of public health professionals and four teams of undergraduate students and applied four clustering algorithms to identify the team’s subproblems and achieve the aforementioned research goals. The use of clustering algorithms to identify subproblems has not been documented before, and clustering presents a repeatable and objective method for determining a team’s subproblems. The results from these algorithms as well as metrics noting the each result’s quality were captured for all teams. We learned that each clustering algorithm has strengths and weaknesses depending on how the team discussed the problem, but the algorithms always accurately identify at least some of the discussed subproblems. Studying these identified subproblems reveals a team’s design process and provides insight into their final design choices.Item Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput mRNA-Seq(2015) Gibbons, Theodore Robert; Delwiche, Charles F; Cell Biology & Molecular Genetics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Dinoflagellates are a diverse and ancient lineage of globally abundant algae that have adapted to fill a diverse array of important ecological roles. Despite their importance, dinoflagellate genomes remain relatively poorly understood because of their enormous size. It is suspected that dinoflagellate genomes have expanded through rampant gene duplication, possibly using a lineage-specific mechanism that involves reinsertion of mature transcripts back into the genome, and that may rely on spliced leader trans-splicing for reactivation and processing of recycled transcripts. Draft genomes have recently been published for two extremely small endosymbiotic species. These genomes confirm expansion of nearly 10k gene families, relative to other eukaryotes. In the more complete genome, evidence for transcript recycling based on relict spliced leader sequences was found in over 5,500 genes. Genomic efforts in larger dinoflagellates have focused instead on transcriptome sequencing, but transcriptomes assembled from short-read HTS data contain very little evidence for rampant gene duplication, or for trans-splicing. I have shown that apparent disagreement with hypotheses related to ubiquitous trans-splicing and widespread gene duplication are the result of technological limitations. By leveraging the statistical power of high-throughput sequencing, I found that spliced leader suffixes as short as six nucleotides are sufficient for positive identification. I also found that isoform sequences from families of conserved paralogs are systematically collapsed during assembly, but that many of these consensus sequences can be identified using a custom SNP-calling procedure that can be combined with traditional clustering based on pairwise sequence alignment to obtain a more complete picture of gene duplication in dinoflagellates. Efficient, automated homology detection based on pairwise sequence alignment is an equally challenging problem for which there is much room for improvement. I explored alternative metrics for scoring alignments between sequences using a popular procedure based on BLAST and Markov clustering, and showed that simplified metrics perform as well or better than more popular alternatives. I also found that Markov clustering of protein sequences suffers from a serious false positive problem when compared against manual curation, suggesting that it is more appropriate for pre-clustering of very large data sets than as a complete clustering solution.Item Data Representation for Learning and Information Fusion in Bioinformatics(2013) Rajapakse, Vinodh Nalin; Czaja, Wojciech; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data.Item Searching, clustering and evaluating biological sequences(2012) Ghodsi, Mohammadreza; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponential, and so far, has outpaced the rate of improvement of computer speed and capacity. This rate of growth, however, makes analysis of new datasets increasingly difficult, and highlights the need for efficient, scalable and modular software tools. Fortunately most types of analysis of sequence data involve a few fundamental operations. Here we study three such problems, namely searching for local alignments between two sets of sequences, clustering sequences, and evaluating the assemblies made from sequence fragments. We present simple and efficient heuristic algorithms for these problems, as well as open source software tools which implement these algorithms. First, we present approximate seeds; a new type of seed for local alignment search. Approximate seeds are a generalization of exact seeds and spaced seeds, in that they allow for insertions and deletions within the seed. We prove that approximate seeds are completely sensitive. We also show how to efficiently find approximate seeds using a suffix array index of the sequences. Next, we present DNACLUST; a tool for clustering millions of DNA sequence fragments. Although DNACLUST has been primarily made for clustering 16S ribosomal RNA sequences, it can be used for other tasks, such as removing duplicate or near duplicate sequences from a dataset. Finally, we present a framework for comparing (two or more) assemblies built from the same set of reads. Our evaluation requires the set of reads and the assemblies only, and does not require the true genome sequence. Therefore our method can be used in de novo assembly projects, where the true genome is not known. Our score is based on probability theory, and the true genome is expected to obtain the maximum score.Item Location Choice, Product Choice, and the Human Resource Practices of Firms(2007-05-10) Freedman, Matthew L.; Haltiwanger, John C; Economics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis is comprised of three chapters. The first investigates the implications of industrial clustering for labor mobility and earnings dynamics. Motivated by a theoretical model in which geographically clustered firms compete for workers, I exploit establishment-level variation in agglomeration to explore the impact of clustering in the software publishing industry on labor market outcomes. The results show that clustering makes it easier for workers to job hop among establishments within the sector. Further, workers in clusters have relatively steep earnings-tenure profiles, accepting lower wages early in their careers in exchange for stronger earnings growth and higher wages later. These findings underscore the importance of geography in understanding labor market dynamics within industries. While the first chapter reveals striking relationships between the human resource practices and location decisions of high-technology establishments, the second chapter (joint with F. Andersson, J. Haltiwanger, J. Lane, and K. Shaw) draws key connections between the hiring and compensation policies of innovative firms and the nature of their product markets. We show that software firms that operate in product markets with highly skewed returns to innovation pay a premium to attract talented workers. Yet these same firms also reward loyalty; that is, highly skilled workers faithful to their employers enjoy higher earnings in firms with a greater variance in potential payoffs from innovation. These results not only contribute to our knowledge of firm human resource practices and product market strategies, but also shed light on patterns of income inequality within and between industries. Building on this final idea, the last chapter (joint with F. Andersson, E. Davis, J. Lane, B. McCall, and L. Sandusky) examines the contribution of worker and firm reallocation to within-industry changes in earnings inequality. We find that the entry and exit of firms and the sorting of workers and firms based on worker skills are key determinants of changes in industry earnings distributions over time. However, the importance of these and other factors in driving observed dynamics in earnings inequality varies across sectors, with aggregate shifts often disguising fundamental differences in the underlying forces effecting change.Item Collective Entity Resolution In Relational Data(2006-12-11) Bhattacharya, Indrajit; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities. Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as `collective entity resolution'. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches.