Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
4 results
Search Results
Item Change Detection: Theoretical and Applied Approaches for Providing Updates Related to a Topic of Interest(2024) Rogers, Kristine M.; Oard, Douglas; Library & Information Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The type of user studied in this dissertation has built up expertise on a topic of interest to them, and regularly invests time to find updates on that topic. This research area—referred to within this dissertation as "change detection"—includes the user's process of identifying what has changed as well as internalizing the changes into their mental model. For these users who follow a specific topic over time, how might a system organize information to enable them to update their mental model quickly? Current information retrieval systems are largely not optimized for addressing the long-term change detection needs of users. This dissertation focuses on approaches for enhancing the change detection process, including for short documents (e.g., social media) as well as longer documents (e.g., news articles). This mixed methods exploration of change detection consists of four sections. First, this dissertation introduces a new theory: the Group-Pile-Arrange (GPA) Change Detection Theory. This theory is about organizing documents relevant to a topic of interest in order to accelerate an individual's ability to identify changes and update their mental model. The three components of this theory include: 1. Group the documents by theme; 2. Pile the grouped documents into an order; and 3. Arrange the piles in a meaningful way for the user. These steps could be applied in a range of ways, including using approaches driven by people (e.g., a research librarian providing information), computers (e.g., an information retrieval system), or a hybrid of the two. The second section of this dissertation includes the results of a survey on users' sort order preferences in social media. For this study, change detection was compared with three other use cases: following an event while it happens (experiential), running a search within social media, and browsing social media posts. Respondents recognized the change detection use case, with 66% of the respondents indicating that they perform change detection tasks on social media sites. When engaged in change detection tasks, these respondents showed a strong preference for posts to be clustered and presented in reverse chronological order, in alignment with the "group" and "pile" components of the GPA Change Detection Theory. These organization preferences were distinct from the other studied use cases. To further understand users' goals and preferences related to change detection, the third section of this dissertation includes the design and prototype implementation of a change detection system called Daybreak. The Daybreak system presents news articles relevant to a user's topic of interest and allows them to tag articles and apply tag labels. Based on these tags and tag labels, the system retrieves new results, groups them into subtopic clusters based on the user's tags, enables generation of chronological or relevance-based piles of documents, and arranges the piles by subtopic importance; for this study, rarity was used as a proxy for subtopic importance. The Daybreak system was used for a qualitative user study, using the framework method for analyzing and interpreting results. In this study, fifteen participants engaged in a change detection scenario across five simulated "days." The participants heavily leveraged the Daybreak system's clustering function when viewing results; there was a weak preference for chronological sorting of documents, compared to relevance ranking. The participants did not view rarity as an effective proxy for subtopic importance; instead, they preferred approaches that enabled them to indicate which subtopics were of greatest interest, such as pinning certain subtopics. The fourth and final component of this dissertation research describes an evaluation approach for comparing arrangements of subtopic clusters (piles). This evaluation approach uses Spearman's rank correlation coefficient to compare a user's ideal subtopic ordering with a variety of system-generated orderings. This includes a sample evaluation using data from the Daybreak user study to demonstrate how a formal evaluation would work. Based on the results of these four dissertation research components, it appears that the GPA Change Detection Theory provides a useful framework for organizing information for individuals engaged in change detection tasks. This research provides insights into users' change detection needs and behaviors that could be helpful for building or extending systems attempting to address this use case.Item Multi-Stage Search Architectures for Streaming Documents(2013) Asadi, Nima; Lin, Jimmy; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality.Item Improving Statistical Machine Translation Using Comparable Corpora(2010) Snover, Matthew Garvey; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available. Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques? To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related. The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources.Item IDENTITY RESOLUTION IN EMAIL COLLECTIONS(2009) Elsayed, Tamer Mohamed; Oard, Douglas W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Access to historically significant email collections poses challenges that arise less often in personal collections. Most notably, people exploring a large collection of emails, in which they were not sending or receiving, may not be very familiar with the discussions that exist in this collection. They would not only need to focus on understanding the topical content of those discussions, but would also find it useful to understand who the people sending, receiving, or mentioned in these discussions were. In this dissertation, the problem of resolving personal identity in the context of large email collections is tackled. In such collections, a common name (e.g., John) might easily refer to any one of several hundred people; when one of these people was mentioned in an email, the question then arises: "who is that John?'' To "resolve identity'' of people in an email collection, two problems need to be solved: (1) modeling the identity of the participants in that collection, and (2) resolving name-mentions (that appeared in the body of the messages) to these identities. To tackle the first problem, a simple computational model of identity, that is built on extracting unambiguous references (e.g., full names from headers, or nicknames from free-text signatures) to people from the whole collection, is presented. To tackle the second problem, a generative probabilistic approach that leverages the model of identity to resolve mentions is presented. The approach is motivated by intuitions about the way people might refer to others in an email; it expands the context surrounding a mention in four directions: the message where the mention was observed, the thread that includes that message, topically-related messages, and messages sent or received by the original communicating parties. It relies on less ambiguous references (e.g., email addresses or full names) that are observed in some context of a given mention to rank potential referents of that mention. In order to jointly resolve all mentions in the collection, a parallel implementation is presented using the MapReduce distributed-programming framework. The implementation decomposes the structure of the resolution process into subcomponents that fit the MapReduce task model well. At the heart of that implementation, a parallel algorithm for efficient computation of pairwise document similarity in large collections is proposed as a general solution that can be used for scalable context expansion of all mentions and other applications as well. The resolution approach compares favorably with previously-reported techniques on small test collections (sets of mention-queries that were manually resolved beforehand) that were used to evaluate the task in the literature. However, the mention-queries in those collections, besides being relatively few in number, are limited in that all refer to people for whom a substantial amount of evidence would be expected to be available in the collection thus omitting the "long tail'' of the identity distribution for which less evidence is available. This motivated the development of a new test collection that now is the largest and best-balanced test collection available for the task. To build this collection, a user study was conducted that also provided some insight into the difficulty of the task and how time-consuming it is when humans perform it, and the reliability of their task performance. The study revealed that at least 80% of the 584 annotated mentions were resolvable to people who had sent or received email within the same collection. The new test collection was used to experimentally evaluate the resolution system. The results highlight the importance of the social context (that includes messages sent or received by the original communicating parties) when resolving mentions in email. Moreover, the results show that combining evidence from multiple types of contexts yields better resolution than what can be achieved using any individual context. The one-best selection is correct 74% of the time when tested on the full set of the mention-queries, and 51% of the time when tested on the mention-queries labeled as "hard'' by the annotators. Experiments run with iterative reformulation of the resolution algorithm resulted in modest gains only for the second iteration in the social context expansion.