HumEval 2023 Archive
The 3rd Workshop on Human Evaluation of NLP Systems (HumEval’23)
News
1 March: The third edition of HumEval will be held at RANLP 2023!
1 May: Call for Papers published
15 June: Elizabeth Clark, Google Research, confirmed as keynote speaker
10 July: Workshop submission deadline extended to 20 July 2023
12 August: Invited Talk details available
4 September: Workshop schedule available!
Workshop Schedule (all times local Varna time)
Please note that there may be very minor differences between the paper versions linked here and what will be the final versions in the proceedings.
0930-0940: Welcome [chair: Maja Popovic]
0940-1040: Evaluation Methods [chair: Irina Temnikova]
Linda Wiechetek, Flammie Pirinen and Per Kummervold: A Manual Evaluation Method of Neural MT for Indigenous Languages
Iva Bojic, Jessica Chen, Si Yuan Chang, Qi Chwen Ong, Shafiq Joty and Josip Car: Hierarchical Evaluation Framework: Best Practices for Human Evaluation
Tomono Honda, Atsushi Fujita, Mayuka Yamamoto and Kyo Kageura: Designing a Metalanguage of Differences Between Translations: A Case Study for English-to-Japanese Translation
1040-1100: ReproNLP Summary [chair/presenter: Anya Belz]
1100-1130: BREAK
1130-1300: ReproNLP Session 1 [chair: Anya Belz]
Javier González Corbelle, Jose Alonso and Alberto Bugarín-Diz: Some lessons learned reproducing human evaluation of a data-to-text system
Lewis Watson and Dimitra Gkatzia: Unveiling NLG Human-Evaluation Reproducibility: Lessons Learned and Key Insights from Participating in the ReproNLP Challenge
Combined Q&A for reproduction studies of Data-to-text Generation with Macro Planning A
Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Debby Damen, Martijn Goudbeek, Chris van der Lee, Frédéric Tomas and Emiel Krahmer: How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’
Mohammad Arvan and Natalie Parde: Human Evaluation Reproduction Report for Data-to-text Generation with Macro Planning
Combined Q&A for reproduction studies of Data-to-text Generation with Macro Planning B
Takumi Ito, Qixiang Fang, Pablo Mosteiro, Albert Gatt and Kees van Deemter: Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization
Mingqi Gao, Jie Ruan and Xiaojun Wan: A Reproduction Study of the Human Evaluation of Role-Oriented Dialogue Summarization Models
Combined Q&A for reproduction studies of Other roles matter! enhancing role-oriented dialogue summarization via role interactions
1300-1430 LUNCH
1430-1600: ReproNLP Session 2 [chair: Ehud Reiter]
Margot Mieskes and Jacob Georg Benz: h_da@ReproHum – Reproduction of Human Evaluation and Technical Pipeline
Manuela Hürlimann and Mark Cieliebak: Reproducing a Comparative Evaluation of German Text-to-Speech Systems
Combined Q&A for reproduction studies of Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
Ondrej Platek, Mateusz Lango and Ondrej Dusek: With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector
Filip Klubička and John D. Kelleher: HumEval’23 Reproduction Report for Paper 0040: Human Evaluation of Automatically Detected Over- and Undertranslations
Combined Q&A for reproduction studies of As little as possible, as much as necessary: Detecting over- and undertranslations with contrastive conditioning
Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim: Same Trends, Different Answers: Insights from a Replication Study of Human Plausibility Judgments on Narrative Continuations
Saad Mahamood: Reproduction of Human Evaluations in: “It’s not Rocket Science: Interpreting Figurative Language in Narratives”
Combined Q&A for reproduction studies of It’s not Rocket Science: Interpreting Figurative Language in Narratives
1600-1630: BREAK
1630-1730: Invited Talk by Elizabeth Clark [chair: Joao Sedoc]
1730: Closing Session [Ehud Reiter]
Invited Speaker: Elizabeth Clark, Google Research
Title: The importance (and challenges) of collecting human evaluations for better NLG metrics
Abstract: Human evaluations can be used to develop better automatic metrics, both as training data and as meta-evaluation benchmarks for proposed metrics. To support the development of summarization metrics, we released Seahorse, a dataset of 96K multilingual, multifaceted human ratings of summaries. I will describe the dataset and demonstrate how it can be used to train and evaluate summarization metrics. I will then discuss challenges in collecting human evaluations and suggest directions for improving them, especially given the capabilities of today’s NLG models.
Bio: Elizabeth Clark is a research scientist at Google DeepMind in New York. She works on problems in natural language generation and automatic and human evaluation. She received her PhD from the University of Washington, where she worked on models and evaluation for human-machine collaborative writing.
Workshop Topic and Content
The HumEval workshops (previously at EACL 2021 and ACL 2022) aim to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP faces in many respects, including experimental design, meta-evaluation and reproducibility. We invite papers on topics including, but not limited to, the following topics as addressed in any subfield of NLP:
- Experimental design and methods for human evaluations
- Reproducibility of human evaluations
- Inter-evaluator and intra-evaluator agreement
- Ethical considerations in human evaluation of computational systems
- Quality assurance for human evaluation
- Crowdsourcing for human evaluation
- Issues in meta-evaluation of automatic metrics by correlation with human evaluations
- Alternative forms of meta-evaluation and validation of human evaluations
- Comparability of different human evaluations
- Methods for assessing the quality and the reliability of human evaluations
- Role of human evaluation in the context of Responsible and Accountable AI
We welcome work from any subfield of NLP (and ML/AI more generally), with a particular focus on evaluation of systems that produce language as output.