Fourth Workshop on Human Evaluation of NLP Systems (HumEval 2024)

Call for Papers

Background and Context

Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced a number of automatic evaluation metrics, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018; Mathur et al., 2020a), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Gatt and Belz, 2008; Popović and Ney, 2011., Shimorina, 2018; Mille et al., 2019; Dušek et al., 2020, Mathur et al., 2020b).

Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019; Gehrmann et al., 2023), and that low-quality evaluations with crowdworkers may not correlate well with high-quality evaluations with domain experts (Freitag et al., 2021). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn (Belz et al., 2023).

We have found that more than 200 different quality criteria (such as Fluency, Accuracy, Readability, etc.) have been used in NLP, and that different papers use the same quality criterion name with different definitions, and the same definition with different names (Howcroft et al., 2020). Furthermore, many papers do not use a named criterion, asking the evaluators only to assess ‘how good’ the output is. Inter and intra-annotator agreement are usually given only in the form of an overall number without analysing the reasons and causes for disagreement and potential to reduce them. A small number of papers have aimed to address this from different perspectives, e.g. comparing agreement for different evaluation methods (Belz and Kow, 2010), or analysing errors and linguistic phenomena related to disagreement (Pavlick and Kwiatkowski, 2019; Oortwijn et al., 2021; Thomson and Reiter, 2020; Popović, 2021). Context beyond sentences needed for a reliable evaluation has also started to be investigated (e.g. Castilho et al., 2020).

The above aspects all interact in different ways with the reliability and reproducibility of human evaluation measures. While reproducibility of automatically computed evaluation measures has attracted attention for a number of years (e.g. Pineau et al., 2018, Branco et al., 2020), research on reproducibility of measures involving human evaluations is a more recent addition (Cooper & Shardlow, 2020; Belz et al., 2023).


HumEval’21 at EACL 2021:
HumEval’22 at ACL 2022:
HumEval’23 at RANLP 2023:

Workshop Topic and Content

The HumEval workshops (previously at EACL 2021, ACL 2022, and RANLP 2023) aim to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP faces in many respects, including experimental design, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following topics as addressed in any subfield of NLP:

  • Experimental design and methods for human evaluations
  • Reproducibility of human evaluations
  • Inter-evaluator and intra-evaluator agreement
  • Ethical considerations in human evaluation of computational systems
  • Quality assurance for human evaluation
  • Crowdsourcing for human evaluation
  • Issues in meta-evaluation of automatic metrics by correlation with human evaluations
  • Alternative forms of meta-evaluation and validation of human evaluations
  • Comparability of different human evaluations
  • Methods for assessing the quality and the reliability of human evaluations
  • Role of human evaluation in the context of Responsible and Accountable AI

Invited Speakers

Sheila Castilho, Dublin City University and ADAPT Research Centre, Ireland
Mark Diaz, Google Research, US

ReproNLP Shared Task

The third ReproNLP Shared Task on Reproduction of Automatic and Human Evaluations of NLP Systems will be part of HumEval, offering (A) an Open Track for any reproduction studies involving human evaluation of NLP systems; and (B) the ReproHum Track where participants will reproduce the papers currently being reproduced by partner labs in the EPSRC ReproHum project. A separate call will be issued for ReproNLP 2024.

Important dates

  • Workshop paper submission deadline: 11 March 2024
  • Workshop paper acceptance notification: 4 April 2024
  • Workshop paper camera-ready versions: 19 April 2024
  • HumEval 2024: 21 May 2024
  • LREC-COLING 2024 conference: 20–25 May 2024

All deadlines are 23:59 UTC-12.


Long papers

Long papers must describe substantial, original, completed and unpublished work. Wherever appropriate, concrete evaluation and analysis should be included. Long papers may consist of up to eight (8) pages of content, plus unlimited pages of references. Final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be taken into account. Long papers will be presented orally or as posters as determined by the programme committee. Decisions as to which papers will be presented orally and which as posters will be based on the nature rather than the quality of the work. There will be no distinction in the proceedings between long papers presented orally and as posters.

Short papers

Short paper submissions must describe original and unpublished work. Examples of short papers are a focused contribution, a negative result, an opinion piece, an interesting application nugget, a small set of interesting results. Short papers may consist of up to four (4) pages of content, plus unlimited pages of references. Final versions of short papers will be given one additional page of content (up to 5 pages) so that reviewers’ comments can be taken into account. Short papers will be presented orally or as posters as determined by the programme committee. While short papers will be distinguished from long papers in the proceedings, there will be no distinction in the proceedings between short papers presented orally and as posters.

Multiple submission policy

HumEval’24 allows multiple submissions. However, if a submission has already been, or is planned to be, submitted to another event, this must be clearly stated in the submission form.

Submission procedure and templates

Please submit short and long papers directly via START by the submission deadline.
Please follow the submission guidelines issued by LREC-COLING 2024.

Optional Supplementary Materials: Appendices, Software and Data

Additionally, supplementary materials can be added in an appendix. If you wish to make available software and data to accompany the paper, please indicate this in the paper, but for the submission fully anonymise all links.


Simone Balloccu, Karls University, CZ
Anya Belz, ADAPT Centre, Dublin City University, Ireland
Rudali Huidrom, Dublin City University, Ireland
Ehud Reiter, University of Aberdeen, UK
João Sedoc, New-York University
Craig Thomson, University of Aberdeen, UK
For questions and comments regarding the workshop please contact the organisers at

Programme committee

Albert Gatt, Utrecht University, NL
Leo Wanner, Universitat Pompeu Fabra, ES
Alberto José Bugarín Diz, University of Santiago de Compostela, ES
Jose Alonso, University of Santiago de Compostela, ES
Antonio Toral, Groningen University, NL
Malik Altakrori, McGill University, CA
Aoife Cahill, Dataminr, US
Malvina Nissim, Groningen University, NL
Dimitra Gkatzia, Edinburgh Napier University, UK
Yiru Li, Groningen University, NL
Margot Mieskes, University of Applied Sciences, Darmstadt, DE
Mark Cieliebak, Zurich University of Applied Sciences, CH
Diyi Yang, Georgia Tech, US
Mingqi Gao, Peking University, CN
Elizabeth Clark, Google Research, US
Mohammad Arvan, University of Illinois, Chicago, US
Filip Klubicka, Technical University Dublin, IE
Mohit Bansal, UNC Charlotte, US
Gavin Abercrombie, Heriot-Watt University, UK
Natalie Parde, University of Illinois, Chicago, US
Huiyuan Lai, Groningen University, NL
Ondřej Dušek, Karls University, CZ
Yiru Li, Groningen University, NL
Ondřej Plátek, Karls University, NL
Ito Takumi, Utrecht University, NL
Rudali Huidrom, ADAPT/DCU, IE
Jackie Cheung, McGill University, CA
Saad Mahamood, Trivago N.V., DE
Pablo Mosteiro, Utrecht University, NL
Steffen Eger, University of Mannheim, DE
Jie Ruan, Peking University, CN
Tanvi Dinkar, Heriot-Watt University, UK
Joel Tetreault, Dataminr, US
John Kelleher, Technical University Dublin, IE
Xiaojun Wan, Peking University, CN
Kees van Deemter, Utrecht University, NL
Yanran Chen, Bielefeld University, DE
Lewis Watson, Edinburgh Napier University, UK
Dirk Hovy, Bocconi University, IT
Hürlimann Manuela, Zurich University of Applied Sciences, CH
Javier González Corbelle, University of Santiago de Compostela, ES
Gonzalo Méndez, Universidad Complutense de Madrid, ES
Raquel Hervas, Universidad Complutense de Madrid, ES
Marzena Karpinska, University of Massachusetts Amherst, US
Fresen Vivian, University of Mannheim, DE
Mariet Theune, University of Twente, NL
Daniel Braun, University of Twente, NL
Maria Keet, University of Cape Town, SA
Zola Mahlaza, University of Cape Town, SA
Toky Raboanary, University of Cape Town, SA
Mateusz Lango, Karls University, CZ
Patricia Schmidtova, Karls University, CZ
Anouck Braggaar, Tilburg University, NL