Accepted Papers

Workshop on Human Evaluation of NLP Systems

Short Papers

Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang, Daniel Duckworth, Daphne Ippolito and Arvind Neelakantan
Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco, Damir Juric, Aleksandar Savkov and Ehud Reiter
A Preliminary Study on Evaluating Consultation Notes With Post-Editing
Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov and Ehud Reiter
The Great Misalignment Problem in Human Evaluation of NLP Methods
Mika Hämäläinen and Khalid Alnajjar
Eliciting Explicit Knowledge From Domain Experts in Direct Intrinsic Evaluation of Word Embeddings for Specialized Domains
Goya van Boven and Jelke Bloem
Detecting Post-Edited References and Their Effect on Human Evaluation
Věra Kloudová, Ondřej Bojar and Martin Popel

Long Papers

It’s Commonsense, isn’t it? Demystifying Human Evaluations in Commonsense-Enhanced NLG systems
Miruna-Adriana Clinciu, Dimitra Gkatzia and Saad Mahamood
Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation
Jakob Nyberg, Maike Paetzel and Ramesh Manuvinakurike
Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation
Sheila Castilho
Is This Translation Error Critical?: Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors
Katsuhito Sudoh, Kosuke Takahashi and Satoshi Nakamura
A View From The Crowd: Evaluation Challenges for Time-Offset Interaction Applications
Alberto Chierici and Nizar Habash
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Neslihan Iskender, Tim Polzehl and Sebastian Möller
On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs
Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann and Tom Kocmi
A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist
Shaily Bhatt, Rahul Jain, Sandipan Dandapat and Sunayana Sitaram
Interrater Disagreement Resolution: A Systematic Procedure to Reach Consensus in Annotation Tasks
Yvette Oortwijn, Thijs Ossenkoppele and Arianna Betti