Gathering Natural Language Processing Data Using Experts

Thumbnail Image


Publication or External Link





Natural language processing needs substantial data to make robust predictions. Automatic methods, unspecialized crowds, and domain experts can be used to collect conversational and question answering NLP datasets. A hybrid solution of combining domain experts with the crowd generates large-scale, free-form language data.

A low-cost, high-output approach to data creation is automation. We create and analyze a large-scale audio question answering dataset through text-to-speech technology. Additionally, we create synthetic data from templates to identify limitations in machine translation. However, in Quizbowl questions are read at an unusually fast pace and involve highly technical and multi-cultural words causing a disparity between automation and reality. We conclude that the cost-savings and scalability of automation come at the cost of data quality and naturalness.

Human input can provide this degree of naturalness, but is limited in scale. Hence, large-scale data collection is frequently done through crowd-sourcing. A question-rewriting task, in which a long information-gathering conversation is used as source material for many stand-alone questions, shows the limitation of using this methodology for generating data. We automatically prevent unsatisfactory submissions with an interface, but the quality control process requires manually reviewing 5,000 questions. Standard inter-annotator agreement metrics, while useful for annotation, cannot easily evaluate generated data, causing a quality control issue.

Therefore, we posit that using domain experts for data generation can create novel and reliable NLP datasets. First, we introduce computational adaptation, which adapts, rather than translates, entities across cultures. We work with native speakers in two countries to generate the data, since the gold label for this is subjective and paramount. Furthermore, we hire professional translators to assess our data. Last, in a study on the game of Diplomacy, community members generate a corpus of 17,000 messages that are self-annotated while playing a game about trust and deception. The language is varied in length, tone, vocabulary, punctuation, and even emojis. Additionally, we create a real-time self-annotation system that annotates deception in a manner not possible through crowd-sourced or automatic methods. The extra effort in data collection will hopefully ensure the longevity of these datasets and galvanize other novel NLP ideas.

However, experts are expensive and limited in number. Hybrid solutions pair potentially unreliable and unverified users in the crowd with experts. We work with Amazon customer service agents to generate and annotate of goal-oriented 81,000 conversations across six domains. Grounding the conversation with a reliable conversationalist---the Amazon agent---creates free-form conversations; using the crowd scales these to the size needed for neural networks.