Workshop on Human Evaluation of NLP Systems, 27 May, ACL 2022

Programme

All timings are in UTC+1 (Dublin, Ireland).

Time Event
09:00–10:00 Invited Talk: Experts, errors, and context: A large-scale study of human evaluation for machine translation
by Markus Freitag, Google
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We further discuss the impact of this study on both the WMT metric task, and the general MT task. We will close the talk by showcasing research that benefits from the new evaluation methodology: Minimum Bayes Risk Decoding with neural metrics significantly outperforms beam search decoding in expert-based human evaluations while the previous human evaluation standards using crowd-workers set both decoding strategies on par with each other.
10:00–10:30 Oral Session 1
Chair: Tom Kocmi
10:00–10:10 A Methodology for the Comparison of Human Judgments With Metrics for Coreference Resolution
Mariya Borovikova, Loı̈c Grobol, Anaı̈s Lefeuvre Halftermeyer and Sylvie Billot
10:10–10:20 Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Vivien Macketanz, Babak Naderi, Steven Schmidt and Sebastian Möller
10:20–10:30 Towards Human Evaluation of Mutual Understanding in Human-Computer Spontaneous Conversation: An Empirical Study of Word Sense Disambiguation for Naturalistic Social Dialogs in American English
Alex Lưu
10:30–11:00 Break
11:00–12:20 Oral Session 2
Chair: Elizabeth Clark
11:00–11:20 A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification
Varvara Logacheva, Daryna Dementieva, Irina Krotova, Alena Fenogenova, Irina Nikishina, Tatiana Shavrina and Alexander Panchenko
11:20–11:40 Beyond calories: evaluating how tailored communication reduces emotional load in diet-coaching
Simone Balloccu and Ehud Reiter
11:40–12:00 Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer
Huiyuan Lai, Jiali Mao, Antonio Toral and Malvina Nissim
12:00–12:20 The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP
Anastasia Shimorina and Anya Belz
12:20–14:00 Lunch
14:00–15:00 Oral Session 3
Chair: Carolina Scarton
14:00–14:20 Human evaluation of web-crawled parallel corpora for machine translation
Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu and Sergio Ortiz Rojas
14:20–14:40 Toward More Effective Human Evaluation for Machine Translation
Belén C Saldías Fuentes, George Foster, Markus Freitag and Qijun Tan
14:40–15:00 Vacillating Human Correlation of SacreBLEU in Unprotected Languages
Ahrii Kim and Jinhyeon Kim
15:00–15:30 Break
15:30–16:30 Invited Talk: Cognitive Biases in Human Evaluation of NLG
by Samira Shaikh, University of North Carolina at Charlotte / Ally
Humans quite frequently interact with conversational agents. The rapid advancement in generative language modeling through neural networks has helped advance the creation of intelligent conversational agents. Researchers typically evaluate the output of their models through crowdsourced judgments, but there are no established best practices for conducting such studies. We look closely at the practices of evaluation of NLG output, and discuss implications of human cognitive biases on experiment design and the resulting data.
16:30–17:00 Closing