Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 8 of 8
  • Thumbnail Image
    Item
    Temporal Context Modeling for Text Streams
    (2018) Rao, Jinfeng; Lin, Jimmy; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    There is increasing recognition that time plays an essential role in many information seeking tasks. This dissertation explores temporal models on evolving streams of text and the role that such models play in improving information access. I consider two cases: a stream of social media posts by many users for tweet search and a stream of queries by an individual user for voice search. My work explores the relationship between temporal models and context models: for tweet search, the evolution of an event serves as the context of clustering relevant tweets; for voice search, the user's history of queries provides the context for helping understand her true information need. First, I tackle the tweet search problem by modeling the temporal contexts of the underlying collection. The intuition is that an information need in Twitter usually correlates with a breaking news event, thus tweets posted during that event are more likely to be relevant. I explore techniques to model two different types of temporal signals: pseudo trend and query trend. The pseudo trend is estimated through the distribution of timestamps from an initial list of retrieved documents given a query, which I model through continuous hidden Markov approach as well as neural network-based methods for relevance ranking and sequence modeling. As an alternative, the query trend, is directly estimated from the temporal statistics of query terms, obviating the need for an initial retrieval. I propose two different approaches to exploit query trends: a linear feature-based ranking model and a regression-based model that recover the distribution of relevant documents directly from query trends. Extensive experiments on standard Twitter collections demonstrate the superior effectivenesses of my proposed techniques. Second, I introduce the novel problem of voice search on an entertainment platform, where users interact with a voice-enabled remote controller through voice requests to search for TV programs. Such queries range from specific program navigation (i.e., watch a movie) to requests with vague intents and even queries that have nothing to do with watching TV. I present successively richer neural network architectures to tackle this challenge based on two key insights: The first is that session context can be exploited to disambiguate queries and recover from ASR errors, which I operationalize with hierarchical recurrent neural networks. The second insight is that query understanding requires evidence integration across multiple related tasks, which I identify as program prediction, intent classification, and query tagging. I present a novel multi-task neural architecture that jointly learns to accomplish all three tasks. The first model, already deployed in production, serves millions of queries daily with an improved customer experience. The multi-task learning model is evaluated on carefully-controlled laboratory experiments, which demonstrates further gains in effectiveness and increased system capabilities. This work now serves as the core technology in Comcast Xfinity X1 entertainment platform, which won an Emmy award in 2017 for the technical contribution in advancing television technologies. This dissertation presents families of techniques for modeling temporal information as contexts to assist applications with streaming inputs, such as tweet search and voice search. My models not only establish the state-of-the-art effectivenesses on many related tasks, but also reveal insights of how various temporal patterns could impact real information-seeking processes.
  • Thumbnail Image
    Item
    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation
    (2013) Ture, Ferhan; Lin, Jimmy; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies.
  • Thumbnail Image
    Item
    Semantic integration of geospatial concepts - a study on land use land cover classification systems
    (2011) Wei, Hua; Townshend, John; Geography; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In GI Science, one of the most important interoperability is needed in land use and land cover (LULC) data, because it is key to the evaluation of LULC's many environmental impacts throughout the globe (Foley et al. 2005). Accordingly, this research aims to address the interoperability of LULC information derived by different authorities using different classificatory approaches. LULC data are described by LULC classification systems. The interoperability of LULC data hinges on the semantic integration of LULC classification systems. Existing works on semantically integrating LULC classification systems has a major drawback in finding comparable semantic representations from textual descriptions. To tackle this problem, we borrowed the method of comparing documents in information retrieval, and applied it to comparing LULC category names and descriptions. The results showed significant improvement comparing to previous works. However, lexical semantic methods are not able to solve the semantic heterogeneities in LULC classification systems: the confounding conflict - LULC categories under similar labels and descriptions have different LULC status in reality, and the naming conflict - LULC categories under different labels represent similar LULC type. Without confirmation of their actual land cover status from remote sensing, lexical semantic method cannot achieve reliable matching. To discover confounding conflicts and reconcile naming conflicts, we developed an innovative method of applying remote sensing to the integration of LULC classification systems. Remote sensing is a means of observation on actual LULC status of individual parcels. We calculated parcel level statistics from spectral and textural data, and used these statistics to calculate category similarity. The matching results showed this approach fulfilled its goal - to overcome semantic heterogeneities and achieve more reliable and accurate matching between LULC classifications in the majority of cases. To overcome the limitations of either method, we combined the two by aggregating their output similarities, and achieve better integration. LULC categories that post noticeable differences between lexical semantics and remote sensing once again remind us of semantic heterogeneities in LULC classification systems that must to be overcome before LULC data from different sources become interoperable and serve as the key to understanding our highly interrelated Earth system.
  • Thumbnail Image
    Item
    Long-term Information Preservation and Access
    (2010) Song, Sang Chul; JaJa, Joseph F; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    An unprecedented amount of information encompassing almost every facet of human activities across the world is generated daily in the form of zeros and ones, and that is often the only form in which such information is recorded. A good fraction of this information needs to be preserved for periods of time ranging from a few years to centuries. Consequently, the problem of preserving digital information over a long-term has attracted the attention of many organizations, including libraries, government agencies, scientific communities, and individual researchers. In this dissertation, we address three issues that are critical to ensure long-term information preservation and access. The first concerns the core requirement of how to guarantee the integrity of preserved contents. Digital information is in general very fragile because of the many ways errors can be introduced, such as errors introduced because of hardware and media degradation, hardware and software malfunction, operational errors, security breaches, and malicious alterations. To address this problem, we develop a new approach based on efficient and rigorous cryptographic techniques, which will guarantee the integrity of preserved contents with extremely high probability even in the presence of malicious attacks. Our prototype implementation of this approach has been deployed and actively used in the past years in several organizations, including the San Diego Super Computer Center, the Chronopolis Consortium, North Carolina State University, and more recently the Government Printing Office. Second, we consider another crucial component in any preservation system - searching and locating information. The ever-growing size of a long-term archive and the temporality of each preserved item introduce a new set of challenges to providing a fast retrieval of content based on a temporal query. The widely-used cataloguing scheme has serious scalability problems. The standard full-text search approach has serious limitations since it does not deal appropriately with the temporal dimension, and, in particular, is incapable of performing relevancy scoring according to the temporal context. To address these problems, we introduce two types of indexing schemes - a location indexing scheme, and a full-text search indexing scheme. Our location indexing scheme provides optimal operations for inserting and locating a specific version of a preserved item given an item ID and a time point, and our full-text search indexing scheme efficiently handles the scalability problem, supporting relevancy scoring within the temporal context at the same time. Finally, we address the problem of organizing inter-related data, so that future accesses and data exploration can be quickly performed. We, in particular, consider web contents, where we combine a link-analysis scheme with a graph partitioning scheme to put together more closely related contents in the same standard web archive container. We conduct experiments that simulate random browsing of preserved contents, and show that our data organization scheme greatly minimizes the number of containers needed to be accessed for a random browsing session. Our schemes have been tested against real-world data of significant scale, and validated through extensive empirical evaluations.
  • Thumbnail Image
    Item
    Relevance, Rhetoric, and Argumentation: A Cross-Disciplinary Inquiry into Patterns of Thinking and Information Structuring
    (2009) Huang, Xiaoli; Soergel, Dagobert; Library & Information Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation research is a multidisciplinary inquiry into topicality, involving an in-depth examination of literatures and empirical data and an inductive development of a faceted typology (containing 227 fine-grained topical relevance relationships and 33 types of presentation relationship). This inquiry investigates a large variety of topical connections beyond topic matching, renders a closer look into the structure of a topic, achieves an enriched understanding of topicality and relevance, and induces a cohesive topic-oriented information architecture that is meaningful across topics and domains. The findings from the analysis contribute to the foundation work of information organization, intellectual access / information retrieval, and knowledge discovery. Using qualitative content analysis, the inquiry focuses on meaning and deep structure: Phase 1 : develop a unified theory-grounded typology of topical relevance relationships through close reading of literature and synthesis of thinking from communication, rhetoric, cognitive psychology, education, information science, argumentation, logic, law, medicine, and art history; Phase 2 : in-depth qualitative analysis of empirical relevance datasets in oral history, clinical question answering, and art image tagging, to examine manifestations of the theory-grounded typology in various contexts and to further refine the typology; the three relevance datasets were used for analysis to achieve variation in form, domain, and context. The typology of topical relevance relationships is structured with three major facets: Functional role of a piece of information plays in the overall structure of a topic or an argument; Mode of reasoning: How information contributes to the user's reasoning about a topic; Semantic relationship: How information connects to a topic semantically. This inquiry demonstrated that topical relevance with its close linkage to thinking and reasoning is central to many disciplines. The multidisciplinary approach allows synthesis and examination from new angles, leading to an integrated scheme of relevance relationships or a system of thinking that informs each individual discipline. The scheme resolving from the synthesis can be used to improve text and image understanding, knowledge organization and retrieval, reasoning, argumentation, and thinking in general, by people and machines.
  • Thumbnail Image
    Item
    Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval
    (2008-11-21) Olsson, James Scott; Oard, Douglas W; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation considers the problem of information retrieval in speech. Today's speech retrieval systems generally use a large vocabulary continuous speech recognition system to first hypothesize the words which were spoken. Because these systems have a predefined lexicon, words which fall outside of the lexicon can significantly reduce search quality---as measured by Mean Average Precision (MAP). This is particularly important because these Out-Of-Vocabulary (OOV) words are often rare and therefore good discriminators for topically relevant speech segments. The focus of this dissertation is on handling these out-of-vocabulary query words. The approach is to combine results from a word-based speech retrieval system with those from vocabulary-independent ranked utterance retrieval. The goal of ranked utterance retrieval is to rank speech utterances by the system's confidence that they contain a particular spoken word, which is accomplished by ranking the utterances by the estimated frequency of the word in the utterance. Several new approaches for estimating this frequency are considered, which are motivated by the disparity between reference and errorfully hypothesized phoneme sequences. The first method learns alternate pronunciations or degradations from actual recognition hypotheses and incorporates these variants into a new generative estimator for term frequency. A second method learns transformations of several easily computed features in a discriminative model for the same task. Both methods significantly improved ranked utterance retrieval in an experimental validation on new speech. The best of these ranked utterance retrieval methods is then combined with a word-based speech retrieval system. The combination approach uses a normalization learned in an additive model, which maps the retrieval status values from each system into estimated probabilities of relevance that are easily combined. Using this combination, much of the MAP lost because of OOV words is recovered. Evaluated on a collection of spontaneous, conversational speech, the system recovers 57.5\% of the MAP lost on short (title-only) queries and 41.3\% on longer (title plus description) queries.
  • Thumbnail Image
    Item
    Supporting Exploratory Web Search With Meaningful and Stable Categorized Overviews
    (2006-04-28) Kules, Bill; Shneiderman, Ben; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation investigates the use of categorized overviews of web search results, based on meaningful and stable categories, to support exploratory search. When searching in digital libraries and on the Web, users are challenged by the lack of effective overviews. Adding categorized overviews to search results can provide substantial benefits when searchers need to explore, understand, and assess their results. When information needs are evolving or imprecise, categorized overviews can stimulate relevant ideas, provoke illuminating questions, and guide searchers to useful information they might not otherwise find. When searchers need to gather information from multiple perspectives or sources, categorized overviews can make those aspects visible for interactive filtering and exploration. However, they add visual complexity to the interface and increase the number of tactical decisions to be made while examining search results. Two formative studies (N=18 and N=12) investigated how searchers use categorized overviews in the domain of U.S. government web search. A third study (N=24) evaluated categorized overviews of general web search results based on thematic, geographic, and government categories. Participants conducted four exploratory searches during a two hour session to generate ideas for newspaper articles about specified topics. Results confirmed positive findings from the formative studies, showing that subjects explored deeper while feeling more organized and satisfied, but did not find objective differences in the outcomes of the search task. Results indicated that searchers use categorized overviews based on thematic, geographic, and organizational categories to guide the next steps in their searches. This dissertation identifies lightweight search actions and tactics made possible by adding a categorized overview to a list of web search results. It describes a design space for categorized overviews of search results, and presents a novel application of the brushing and linking technique to enrich search result interfaces with lightweight interactions. It proposes a set of principles, refined by the studies, for the design of exploratory search interfaces, including "Organize overviews around meaningful categories," "Clarify and visualize category structure," and "Tightly couple category labels to search result list." These contributions will be useful to web search researchers and designers, information architects and web developers.
  • Thumbnail Image
    Item
    Matching Meaning for Cross-Language Information Retrieval
    (2005-12-06) Wang, Jianqiang; Oard, Douglas W; Library & Information Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cross-language information retrieval concerns the problem of finding information in one language in response to search requests expressed in another language. The explosive growth of the World Wide Web, with access to information in many languages, has provided a substantial impetus for research on this important problem. In recent years, significant advances in cross-language retrieval effectiveness have resulted from the application of statistical techniques to estimate accurate translation probabilities for individual terms from automated analysis of human-prepared translations. With few exceptions, however, those results have been obtained by applying evidence about the meaning of terms to translation in one direction at a time (e.g., by translating the queries into the document language). This dissertation introduces a more general framework for the use of translation probability in cross-language information retrieval based on the notion that information retrieval is dependent fundamentally upon matching what the searcher means with what the document author meant. The perspective yields a simple computational formulation that provides a natural way of combining what have been known traditionally as query and document translation. When combined with the use of synonym sets as a computational model of meaning, cross-language search results are obtained using English queries that approximate a strong monolingual baseline for both French and Chinese documents. Two well-known techniques (structured queries and probabilistic structured queries) are also shown to be a special case of this model under restrictive assumptions.