HumEval 2022 Archive
|Markus Freitag||Samira Shaikh|
|University of North Carolina at Charlotte / Ally|
Workshop Topic and Content
Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out, e.g., by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced a number of automatic evaluation metrics, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018; Mathur et al., 2020a), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Gatt and Belz, 2008; Popović and Ney, 2011; Shimorina, 2018; Mille et al., 2019; Dušek et al., 2020, Mathur et al., 2020b).
Yet, there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn.
For example, we have found that more than 200 different quality criteria (such as Fluency, Accuracy, Readability, etc.) have been used in NLP (Howcroft et al., 2020). Different papers use the same quality criterion name with different definitions, and the same definition with different names. Furthermore, many papers do not use any particular criterion, asking the evaluators only to assess “how good” the output is. Inter- and intra-annotator agreements are usually given only in the form of an overall number without analysing the reasons and causes for disagreements and potentials to reduce them. A small number of papers are going in this direction from different points of view, such as comparing agreements for different evaluation methods (Belz & Kow, 2010), or analysing errors and linguistic phenomena related to disagreements (Pavlick and Kwiatkowski, 2019; Oortwijn et al., 2021; Thomson and Reiter, 2020; Popović, 2021). Also, context beyond isolated sentences needed for a reliable evaluation has started to be investigated only recently (e.g. Castilho et al., 2020).
All those aspects are important for the reliability and reproducibility of human evaluations. While reproducibility of automatically computed evaluation scores has attracted attention in recent years (e.g. Pineau et al., 2019, Branco et al., 2020), reproducibility of scores obtained by human evaluations has barely been addressed at all (Belz & Kow, 2010; Cooper & Shardlow, 2020) so far. This year, ReproGen shared task (Belz et al., 2021) aimed to shed more light on reproducibility of human evaluations, and the first results indicate that even reproducing one’s own evaluation is not a simple task (e.g. Popović and Belz, 2021) so that much more work is needed in this direction.
With this workshop, we wish to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP faces from many points of view, including experimental design, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following:
- Experimental design and methods for human evaluations
- Reproducibility of human evaluations
- Work on inter-evaluator and intra-evaluator agreement
- Ethical considerations in human evaluation of computational systems
- Quality assurance for human evaluation
- Crowdsourcing for human evaluation
- Issues in meta-evaluation of automatic metrics by correlation with human evaluations
- Alternative forms of meta-evaluation and validation of human evaluations
- Comparability of different human evaluations
- Methods for assessing the quality and the reliability of human evaluations
- Role of human evaluation in the context of Responsible and Accountable AI
We will welcome work from any subfield of NLP (and ML/AI more generally), with a particular focus on evaluation of systems that produce language as output.
Workshop on “Evaluation & Comparison of NLP Systems”: Eval4NLP 2021.