HumEval 2021

HumEval 2021 Archive

Invited Speakers


Margaret Mitchell	Lucia Specia, Imperial College London

Important Dates:

Dec 2, 2020: First Call for Workshop Papers
Jan 4, 2021: Second Call for Workshop Papers (with new dates)
Feb 1, 2021: Third Call for Workshop Papers
Feb 15, 2021: Workshop Papers Due Date (11.59 pm UTC-12)
Mar 22, 2021: Notification of Acceptance
Apr 1, 2021: Camera-ready papers due
Apr 7, 2021: EACL early registration deadline
Apr 19, 2021: HumEval Workshop

Workshop Topic and Content

Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced automatic evaluation metrics from BLEU (Papineni et al, 2001) onwards, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Over et al., 2007; Gatt and Belz, 2008; Bojar et al., 2016; Shimorina, 2018; Ma et al., 2019; Mille et al., 2019; Dušek et al., 2020).

Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn. We have found that more than 200 different quality criteria (Fluency, Grammaticality, etc.) have been used in NLP. Different papers use the same quality criterion name with different definitions, and the same definition with different names. As a result, we currently do not have a way of determining whether two evaluations assess the same thing which poses problems for both meta-evaluation and reproducibility assessments.

Reproducibility in the context of automatically computed system scores has recently attracted a lot of attention, against the background of a troubling history (Pedersen, 2008; Mieskes et al., 2019), where reproduction fails in 24.9% of cases for own results, and in 56.7% for another team’s (Mieskes et al., 2019). Initiatives have included the Reproducibility Challenge (Pineau et al., 2019, Sinha et al., 2020); the Reproduction Paper special category at COLING’18; the reproducibility programme at NeurIPS’19 comprising code submission, a reproducibility challenge, and the ML Reproducibility checklist, also adopted by EMNLP’20 and AAAI’21; and the REPROLANG shared task at LREC’20 (Branco et al., 2020).

However, reproducibility in the context of system scores obtained via human evaluations has barely been addressed at all, with a tiny number of papers (e.g. Belz & Kow, 2010; Cooper & Shardlow, 2020) reporting attempted reproductions of results. The developments in reproducibility of automatically computed scores listed above are important, but it is concerning that not a single one of the initiatives and events above addresses human evaluations. E.g. if a paper fully complies with all of the NeurIPS’19/EMNLP’20 reproducibility criteria, any human evaluation results reported in it may not be reproducible to any degree, simply because the criteria do not address human evaluation in any way.

With this workshop we wish to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP currently faces, including experimental design, reporting standards, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following:

Experimental design for human evaluations
Reproducibility of human evaluations
Ethical considerations in human evaluation of computational systems
Quality assurance for human evaluation
Crowdsourcing for human evaluation
Issues in meta-evaluation of automatic metrics by correlation with human evaluations
Alternative forms of meta-evaluation and validation of human evaluations
Comparability of different human evaluations
Methods for assessing the quality of human evaluations
Methods for assessing the reliability of human evaluations
Work on measuring inter-evaluator and intra-evaluator agreement
Frameworks, model cards and checklists for human evaluation
Explorations of the role of human evaluation in the context of Responsible AI and Accountable AI
Protocols for human evaluation experiments in NLP

We welcome work on the above topics and more from any subfield of NLP (and ML/AI more generally), with a particular focus on evaluation of systems that produce language as output.

2nd Workshop on “Evaluation & Comparison of NLP Systems” Eval4NLP 2021 co-located at EMNLP 2021.

HumEval 2021

HumEval 2021 Archive

Invited Speakers

Important Dates:

Workshop Topic and Content

Related Workshops