Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 8 of 8
  • Thumbnail Image
    Item
    Evaluating Machine Intelligence with Question Answering
    (2021) rodriguez, pedro; Boyd-Graber, Jordan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Humans ask questions to learn about the world and to test knowledge understanding. The ability to ask questions combines aspects of intelligence unique to humans: language understanding, knowledge representation, and reasoning. Thus, building systems capable of intelligent question answering (QA) is a grand goal of natural language processing (NLP). To measure progress in NLP, we create "exams" for computer systems and compare their effectiveness against a reference point---often based on humans. How precisely we measure progress depends on whether we are building computer systems that optimize human satisfaction in information-seeking tasks or that measure progress towards intelligent QA. In the first part of this dissertation, we explore each goal in turn, how they differ, and describe their relationship to QA formats. As an example of an information-seeking evaluation, we introduce a new dialog QA task paired with a new evaluation method. Afterward, we turn our attention to using QA to evaluate machine intelligence. A good evaluation should be able to discriminate between lesser and more capable QA models. This dissertation explores three ways to improve the discriminative power of QA evaluations: (1) dynamic weighting of test questions, (2) a format that by construction tests multiple levels of knowledge, and (3) evaluation data that is created through human-computer collaboration. By dynamically weighting test questions, we challenge a foundational assumption of the de facto standard in QA evaluation---the leaderboard. Namely, we contend that contrary to nearly all QA and NLP evaluations which implicitly assign equal weights to examples by averaging scores, that examples are not equally useful for estimating machine (or human) QA ability. As any student may tell you, not all questions on an exam are equally difficult and in the worst-case questions are unsolvable. Drawing on decades of research in educational testing, we propose adopting an alternative evaluation methodology---Item Response Theory---that is widely used to score human exams (e.g., the SAT). By dynamically weighting questions, we show that this improves the reliability of leaderboards in discriminating between models of differing QA ability while also being helpful in the construction of new evaluation datasets. Having improved the scoring of models, we next turn to improving the format and data in QA evaluations. Our idea is simple. In most QA tasks (e.g., Jeopardy!), each question tests a single level of knowledge; in our task (the trivia game Quizbowl), we test multiple levels of knowledge with each question. Since each question tests multiple levels of knowledge, this decreases the likelihood that we learn nothing about the difference between two models (i.e., they are both correct or both wrong), which substantially increases discriminative power. Despite the improved format, we next show that while our QA models defeat accomplished trivia players, that they are overly reliant on brittle pattern matching, which indicates a failure to intelligently answer questions. To mitigate this problem, we introduce a new framework for building evaluation data where humans and machines cooperatively craft trivia questions that are difficult to answer through clever pattern matching tricks alone---while being no harder for humans. We conclude by sketching a broader vision for QA evaluation that combines the three components of evaluation we improve---scoring, format, and data---to create living evaluations and re-imagine the role of leaderboards.
  • Thumbnail Image
    Item
    On the Social Consequences of the Desire for Motion
    (2016) Chernikova, Marina; Kruglanski, Arie; Psychology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Three studies investigated the effects of locomotion regulatory mode on individuals’ evaluations of social partners who disrupt the smooth forward motion of a social interaction. Locomotion was expected to increase individuals’ preference for smooth motion in social interactions. In turn, that preference was expected to lead to less positive evaluations of listeners who disrupted the “flow” of a social interaction. The results generally did not confirm the predictions. Theoretical and practical implications of the studies, as well as future directions for the research, are discussed.
  • Thumbnail Image
    Item
    A COMPARISON OF EX-ANTE, LABORATORY, AND FIELD METHODS FOR EVALUATING SURVEY QUESTIONS
    (2014) Maitland, Aaron; Presser, Stanley; Survey Methodology; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A diverse range of evaluation methods is available for detecting measurement error in survey questions. Ex-ante question evaluation methods are relatively inexpensive, because they do not require data collection from survey respondents. Other methods require data collection from respondents either in the laboratory or in the field setting. Research has explored how effective some of these methods are at identifying problems with respect to one another. However, a weakness of most of these studies is that they do not compare the range of question evaluation methods that are currently available to researchers. The purpose of this dissertation is to understand how the methods researchers use to evaluate survey questions influence the conclusions they draw about the questions. In addition, the dissertation seeks to identify more effective ways to use the methods together. It consists of three studies. The first study examines the extent of agreement between ex-ante and laboratory methods in identifying problems and compares the methods in how well they predict differences between questions whose validity has been estimated in record-check studies. The second study evaluates the extent to which ex-ante and laboratory methods predict the performance of questions in the field as measured by indirect assessments of data quality such as behavior coding, response latency and item nonresponse. The third study evaluates the extent to which ex-ante, laboratory, and field methods predict the reliability of answers to survey questions as measured by stability over time. The findings suggest (1) that a multiple method approach to question evaluation is the best strategy given differences in the ability to detect different types of problems between the methods and (2) how to combine methods more effectively in the future.
  • Thumbnail Image
    Item
    The Effectiveness of School Based Intensive Probation for Reducing Recidivism: An Evaluation of Maryland's Spotlight on Schools Program
    (2011) Frederique, Nadine P.; Gottfredson, Denise C; Criminology and Criminal Justice; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    School Based Probation programs provide intensive supervision for juvenile probationers by placing probation officers in high schools. However, they have yet to undergo rigorous evaluation. Previous evaluations suffered from methodological flaws and have presented inconsistent findings. The state of Maryland began its SBP program, called Spotlight on Schools (SOS), in the 1990's. It is now used in many schools throughout the state. SOS has never been formally assessed. This dissertation presents results from a quasi-experimental non-equivalent group study examining the recidivism rates of students in schools with and without this probation program. I address the flaws of previous evaluations by using two statistical methods. First, I use multi-level modeling to predict school level recidivism while controlling for statistically relevant individual level and school level characteristics. Second, I use survival analysis to determine if juveniles on SBP experience a longer time in the community before recidivism. These analyses are supplemented with interviews of school principals and probation officers. Results from the multi-level modeling and survival analysis indicate that school participation in the SOS program is not significantly related to likelihood of recidivism or the seriousness of recidivism. Seven of the eight outcome variables assessed in this evaluation are not significantly related to participation in the SOS program. This study joins a long list of intensive supervision evaluations that suggest that these programs have no significant impact on juvenile recidivism.
  • Thumbnail Image
    Item
    Outcomes of an elementary grades social competence experiment according to student self-report
    (2008-06-30) Harak, Elise Touris; Gottfredson, Gary D; Counseling and Personnel Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Problem behaviors that emerge in early childhood often persist through adolescence. Evaluations provide evidence that social skills programs in elementary schools can reduce student aggression. There is some evidence that social skills programs also increase social skills, academic commitment, and achievement. Outcome evaluations have more often focused on aggression than on social skills and academics, however. The present study is a randomized, controlled trial evaluating the effects of one popular social skills instructional program, Second Step, in six treatment and six control schools after two years of implementation. Despite the widespread use of Second Step, few evaluations have assessed its effects. The existing evaluations have either: (a) lacked randomization, (b) had small samples, (c) not measured implementation, or (d) were implemented for one year or less. In the present evaluation, implementation data were collected from all teachers as each lesson was completed. Overall implementation was high across two years. Treatment effects were assessed on nine self-report measures including Engagement in Learning, prosocial behaviors (Altruism, Empathy, and Self-Restraint) and problem behaviors and attitudes (Rebellious Behavior, Aggression, Victimization, Acceptability of Aggression, and Hostile Attribution Bias). Analyses completed using hierarchical linear modeling (HLM) implied that treatment did not statistically significantly affect individual student self-reports net of individual characteristics. In almost all cases, the non-significant estimates of treatment effects were in the desired direction but mirrored non-significant pre-intervention differences.
  • Thumbnail Image
    Item
    Evaluating Host Intrusion Detection Systems
    (2007-11-28) Molina, Jesus; Cukier, Michel; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Host Intrusion Detection Systems (HIDSs) are critical tools needed to provide in-depth security to computer systems. Quantitative metrics for HIDSs are necessary for comparing HIDSs or determining the optimal operational point of a HIDS. While HIDSs and Network Intrusion Detection Systems (NIDSs) greatly differ, similar evaluations have been performed on both types of IDSs by assessing metrics associated with the classification algorithm (e.g., true positives, false positives). This dissertation motivates the necessity of additional characteristics to better describe the performance and effectiveness of HIDSs. The proposed additional characteristics are the ability to collect data where an attack manifests (visibility), the ability of the HIDS to resist attacks in the event of an intrusion (attack resiliency), the ability to timely detect attacks (efficiency), and the ability of the HIDS to avoid interfering with the normal functioning of the system under supervision (transparency). For each characteristic, we propose corresponding quantitative evaluation metrics. To measure the effect of visibility on the detection of attacks, we introduce the probability of attack manifestation and metrics related to data quality (i.e., relevance of the data regarding the attack to be detected). The metrics were applied empirically to evaluate filesystem data, which is the data source for many HIDSs. To evaluate attack resiliency we introduce the probability of subversion, which we estimate by measuring the isolation between the HIDS and the system under supervision. Additionally, we provide methods to evaluate time delays for efficiency, and performance overhead for transparency. The proposed evaluation methods are then applied to compare two HIDSs. Finally, we show how to integrate the proposed measurements into a cost framework. First, mapping functions are established to link operational costs of the HIDS with the metrics proposed for efficiency and transparency. Then we show how the number of attacks detected by the HIDS not only depends on detection accuracy, but also on the evaluation results of visibility and attack resiliency.
  • Thumbnail Image
    Item
    The effect of life-cycle cost disclosure on consumer behavior
    (2007-04-25) Deutsch, Matthias; Ruth, Matthias; Public Policy; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    For more than 20 years, analysts have reported on the so-called "energy paradox" or the "energy efficiency gap", referring to the fact that economic agents could in principle lower their total cost at current prices by using more energy-efficient technology but, nevertheless, often decide not to do so. Theory suggests that providing information in a simplified way could potentially reduce this "efficiency gap". Such simplification may be achieved by providing the estimated monetary operating cost and life-cycle cost (LCC) of a given appliance--which has been a recurring theme within the energy policy and efficiency labeling community. Yet, little is known so far about the causal effects of LCC disclosure on consumer action because of the gap between the acquisition of efficiency information and consumer purchasing behavior in the real marketplace. This dissertation bridges the gap by experimentally integrating LCC disclosure into two major German commercial websites--a price comparison engine for cooling appliances, and an online shop for washing machines. Internet users arriving on these websites were randomly assigned to two experimental groups, and the groups were exposed to different visual stimuli. The control group received regular product price information, whereas the treatment group was, in addition, offered information about operating cost and total LCC. Click-stream data of consumers' shopping behavior was evaluated with multiple regression analysis by controlling for several product characteristics. This dissertation finds that LCC disclosure reduces the mean energy use of chosen cooling appliances by 2.5% (p<0.01), and the energy use of chosen washing machines by 0.8% (p<0.001). For the latter, it also reduces the mean water use by 0.7% (p<0.05). These effects suggest a potential role for public policy in promoting LCC disclosure. While I do not attempt to estimate the costs of such a policy, a simple quantification shows that the benefits amount to 100 to 200 thousand Euros per year for Germany, given current predictions regarding the price of tradable permits for CO2, and not counting other potential benefits. Future research should strive for increasing external validity, using better instruments, and evaluating the effectiveness of different information formats for LCC disclosure.
  • Thumbnail Image
    Item
    An analytic case study of the evaluation reports of a comprehensive community initiative
    (2004-10-05) Frusciante, Angela Katherine; Mawhinney, Hanne B; Education Policy, and Leadership; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This study is a case study of the evaluation reports of the Neighborhood and Family Initiative (NFI). NFI was a ten-year Ford Foundation sponsored comprehensive community initiative (CCI) in four low-income neighborhoods in four United States cities. The NFI evaluation was longitudinal, interdisciplinary, and multi-tiered. Through this study of the eleven publicly released evaluation reports, I found that the evaluators not only wrote about CCIs and evaluation but also evidenced evaluation as part of loosely linked network supporting urban community development. The knowledge community addressed in the study is the Aspen Roundtable on Comprehensive Community Initiatives a national coalition supporting the discussion of evaluation appropriate to community initiatives. The study involved the identification of reporting dimensions from descriptive analysis, evaluation lessons from the documented evaluatorsÂ' interpretations, and change constructs from my theoretical concerns. The study resulted in a discussion of issue areas to be addressed in understanding evaluation reporting of complex social and policy initiatives. These issue areas included: community organization building versus coalition formation, comprehensiveness as a lens for change, audience, institutional distancing, and learning, knowledge development and education. With the study, I also provide an innovative methodological approach to analyzing change through the language evaluators put to initiative reporting. The qualitative approach involved devising a process for analyzing description and evaluator written reflection but also analyzing change of evaluator interpretations. Unlike qualitative approaches that emphasize only themes as recurrences over time, the approach to this study centered ideas as clusters that changed in configuration over time.