Evaluating Machine Intelligence with Question Answering

Thumbnail Image
Publication or External Link
rodriguez, pedro
Boyd-Graber, Jordan
Humans ask questions to learn about the world and to test knowledge understanding. The ability to ask questions combines aspects of intelligence unique to humans: language understanding, knowledge representation, and reasoning. Thus, building systems capable of intelligent question answering (QA) is a grand goal of natural language processing (NLP). To measure progress in NLP, we create "exams" for computer systems and compare their effectiveness against a reference point---often based on humans. How precisely we measure progress depends on whether we are building computer systems that optimize human satisfaction in information-seeking tasks or that measure progress towards intelligent QA. In the first part of this dissertation, we explore each goal in turn, how they differ, and describe their relationship to QA formats. As an example of an information-seeking evaluation, we introduce a new dialog QA task paired with a new evaluation method. Afterward, we turn our attention to using QA to evaluate machine intelligence. A good evaluation should be able to discriminate between lesser and more capable QA models. This dissertation explores three ways to improve the discriminative power of QA evaluations: (1) dynamic weighting of test questions, (2) a format that by construction tests multiple levels of knowledge, and (3) evaluation data that is created through human-computer collaboration. By dynamically weighting test questions, we challenge a foundational assumption of the de facto standard in QA evaluation---the leaderboard. Namely, we contend that contrary to nearly all QA and NLP evaluations which implicitly assign equal weights to examples by averaging scores, that examples are not equally useful for estimating machine (or human) QA ability. As any student may tell you, not all questions on an exam are equally difficult and in the worst-case questions are unsolvable. Drawing on decades of research in educational testing, we propose adopting an alternative evaluation methodology---Item Response Theory---that is widely used to score human exams (e.g., the SAT). By dynamically weighting questions, we show that this improves the reliability of leaderboards in discriminating between models of differing QA ability while also being helpful in the construction of new evaluation datasets. Having improved the scoring of models, we next turn to improving the format and data in QA evaluations. Our idea is simple. In most QA tasks (e.g., Jeopardy!), each question tests a single level of knowledge; in our task (the trivia game Quizbowl), we test multiple levels of knowledge with each question. Since each question tests multiple levels of knowledge, this decreases the likelihood that we learn nothing about the difference between two models (i.e., they are both correct or both wrong), which substantially increases discriminative power. Despite the improved format, we next show that while our QA models defeat accomplished trivia players, that they are overly reliant on brittle pattern matching, which indicates a failure to intelligently answer questions. To mitigate this problem, we introduce a new framework for building evaluation data where humans and machines cooperatively craft trivia questions that are difficult to answer through clever pattern matching tricks alone---while being no harder for humans. We conclude by sketching a broader vision for QA evaluation that combines the three components of evaluation we improve---scoring, format, and data---to create living evaluations and re-imagine the role of leaderboards.