Gathering Natural Language Processing Data Using Experts

Peskov, Denis

Gathering Natural Language Processing Data Using Experts

dc.contributor.advisor	Boyd-Graber, Jordan	en_US
dc.contributor.author	Peskov, Denis	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2022-06-20T05:33:19Z
dc.date.available	2022-06-20T05:33:19Z
dc.date.issued	2021	en_US
dc.description.abstract	Natural language processing needs substantial data to make robust predictions. Automatic methods, unspecialized crowds, and domain experts can be used to collect conversational and question answering NLP datasets. A hybrid solution of combining domain experts with the crowd generates large-scale, free-form language data. A low-cost, high-output approach to data creation is automation. We create and analyze a large-scale audio question answering dataset through text-to-speech technology. Additionally, we create synthetic data from templates to identify limitations in machine translation. However, in Quizbowl questions are read at an unusually fast pace and involve highly technical and multi-cultural words causing a disparity between automation and reality. We conclude that the cost-savings and scalability of automation come at the cost of data quality and naturalness. Human input can provide this degree of naturalness, but is limited in scale. Hence, large-scale data collection is frequently done through crowd-sourcing. A question-rewriting task, in which a long information-gathering conversation is used as source material for many stand-alone questions, shows the limitation of using this methodology for generating data. We automatically prevent unsatisfactory submissions with an interface, but the quality control process requires manually reviewing 5,000 questions. Standard inter-annotator agreement metrics, while useful for annotation, cannot easily evaluate generated data, causing a quality control issue. Therefore, we posit that using domain experts for data generation can create novel and reliable NLP datasets. First, we introduce computational adaptation, which adapts, rather than translates, entities across cultures. We work with native speakers in two countries to generate the data, since the gold label for this is subjective and paramount. Furthermore, we hire professional translators to assess our data. Last, in a study on the game of Diplomacy, community members generate a corpus of 17,000 messages that are self-annotated while playing a game about trust and deception. The language is varied in length, tone, vocabulary, punctuation, and even emojis. Additionally, we create a real-time self-annotation system that annotates deception in a manner not possible through crowd-sourced or automatic methods. The extra effort in data collection will hopefully ensure the longevity of these datasets and galvanize other novel NLP ideas. However, experts are expensive and limited in number. Hybrid solutions pair potentially unreliable and unverified users in the crowd with experts. We work with Amazon customer service agents to generate and annotate of goal-oriented 81,000 conversations across six domains. Grounding the conversation with a reliable conversationalist---the Amazon agent---creates free-form conversations; using the crowd scales these to the size needed for neural networks.	en_US
dc.identifier	https://doi.org/10.13016/hwza-vyz4
dc.identifier.uri	http://hdl.handle.net/1903/28882
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	data	en_US
dc.subject.pquncontrolled	natural language processing	en_US
dc.title	Gathering Natural Language Processing Data Using Experts	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Peskov_umd_0117E_22227.pdf
Size:: 3.89 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations