Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
4 results
Search Results
Item EVALUATING CLUSTERING ALGORITHMS TO IDENTIFY SUBPROBLEMS IN DESIGN PROCESSES(2017) Morency, Michael John; Herrmann, Jeffrey W; Systems Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Design problems are inherently intricate and require multiple dependent decisions. Because of these characteristics, design teams generally choose to decompose the main problem into manageable subproblems. This thesis describes the results of a study designed to (a) explore clustering algorithms as a new and repeatable way to identify subproblems in recorded design team discussions, (b) assess the quality of the identified subproblems, and (c) examine any relationships between the subproblems and final design or team experience level. We observed five teams of public health professionals and four teams of undergraduate students and applied four clustering algorithms to identify the team’s subproblems and achieve the aforementioned research goals. The use of clustering algorithms to identify subproblems has not been documented before, and clustering presents a repeatable and objective method for determining a team’s subproblems. The results from these algorithms as well as metrics noting the each result’s quality were captured for all teams. We learned that each clustering algorithm has strengths and weaknesses depending on how the team discussed the problem, but the algorithms always accurately identify at least some of the discussed subproblems. Studying these identified subproblems reveals a team’s design process and provides insight into their final design choices.Item Searching, clustering and evaluating biological sequences(2012) Ghodsi, Mohammadreza; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The latest generation of biological sequencing technologies have made it possible to generate sequence data faster and cheaper than ever before. The growth of sequence data has been exponential, and so far, has outpaced the rate of improvement of computer speed and capacity. This rate of growth, however, makes analysis of new datasets increasingly difficult, and highlights the need for efficient, scalable and modular software tools. Fortunately most types of analysis of sequence data involve a few fundamental operations. Here we study three such problems, namely searching for local alignments between two sets of sequences, clustering sequences, and evaluating the assemblies made from sequence fragments. We present simple and efficient heuristic algorithms for these problems, as well as open source software tools which implement these algorithms. First, we present approximate seeds; a new type of seed for local alignment search. Approximate seeds are a generalization of exact seeds and spaced seeds, in that they allow for insertions and deletions within the seed. We prove that approximate seeds are completely sensitive. We also show how to efficiently find approximate seeds using a suffix array index of the sequences. Next, we present DNACLUST; a tool for clustering millions of DNA sequence fragments. Although DNACLUST has been primarily made for clustering 16S ribosomal RNA sequences, it can be used for other tasks, such as removing duplicate or near duplicate sequences from a dataset. Finally, we present a framework for comparing (two or more) assemblies built from the same set of reads. Our evaluation requires the set of reads and the assemblies only, and does not require the true genome sequence. Therefore our method can be used in de novo assembly projects, where the true genome is not known. Our score is based on probability theory, and the true genome is expected to obtain the maximum score.Item Collective Entity Resolution In Relational Data(2006-12-11) Bhattacharya, Indrajit; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities. Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as `collective entity resolution'. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches.Item SALAM: A SCALABLE ANCHOR-FREE LOCALIZATION ALGORITHM FOR WIRELESS SENSOR NETWORKS(2006-04-26) Youssef, Adel Amin Abdel Azim; Agrawala, Ashok K; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, we present SALAM, a scalable anchor-free protocol for localization in wireless sensor networks. SALAM can determine the positions of sensor nodes without any infrastructure support. We assume that each node has the capability to estimate distances to its corresponding neighbors, that are within its transmission range. SALAM allows to trade the accuracy of the estimated position against node transmission range and/or computational power. The application layer can choose from a whole range of different options, to estimate the sensor node's positions with different accuracy while conserving battery power. Scalability is achieved by dividing the network into overlapping multi-hop clusters each with its own cluster head node. Each cluster head is responsible for building a local relative map corresponding to its cluster using intra-cluster node's range measurements. To obtain the global relative topology of the network, the cluster head nodes collaboratively combine their local maps using simple matrix transformations. In order for two cluster heads to perform a matrix transformation, there must be at least three boundary nodes that belongs to both clusters (i.e. the two clusters are overlapping with degree 3). We formulate the overlapping multi-hop clustering problem and present a randomized distributed heuristic algorithm for solving the problem. We evaluate the performance of the proposed algorithm through analytical analysis and simulation. A major problem with multi-hop relative location estimation is the error accumulated in the node position as it becomes multi-hop away from the cluster head node. We analyze different sources of error and develop techniques to avoid these errors. We also show how the local coordinate system (LCS) affects the accuracy and propose different heuristics to select the LCS. Simulation results show that SALAM achieves precise localization of sensors. We show that our approach is scalable in terms of communication overhead regardless of the network size. In addition, we capture the impact of different parameters on the accuracy of the estimated node's positions. The results also show that SALAM is able to achieve accuracy better than the current ad-hoc localization algorithms.