Workshop on Human Evaluation of NLP Systems, 21 May, LREC-COLING 2024

Programme

All timings are in UTC+2 (Turin, Italy).

Time Event  
09:00–09:10 Workshop Introduction  
09:10–10:30 Oral Session 1  
09:10–09:30 Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar and Ondřej Bojar
 
09:30–09:50 Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French
Ayla Rigouts Terryn and Miryam de Lhoneux
 
09:50–10:10 Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks
Alexander Frummet and David Elsweiler
 
10:10–10:30 A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić and Filip Miletić
 
10:30–11:00 Coffee Break  
11:00–11:45 Invited Talk 1

Title: Beyond Performance: The Evolving Landscape of Human Evaluation
by Sheila Castilho, ADAPT/DCU

In the field of MT/NLP evaluation, the focus has traditionally been on system performance metrics. However, as language technology continues to evolve, the methodologies for evaluation must evolve as well. One of the main concerns with the rise of LLMs is the data used for training these systems. The use of vast amounts of synthetic and recycled data poses inherent risks of perpetuating biases and harms in the systems, while also contributing to the stagnation of language diversity. These consequences extend beyond mere performance metrics, impacting language vitality and cultural representation. Human evaluation emerges as a critical tool in identifying and mitigating these risks, promoting fairness, inclusivity, and linguistic diversity in system outputs. This talk will delve into how human evaluation is, more than ever, playing a crucial role in addressing these issues and informing the development of more expressive and contextually relevant translations.
ALT: Invited speaker propic
11:45–13:00 ReproNLP Session 1  
11:45–11:55 Introduction  
11:55–12:05 ReproHum #0927-03 (paired):
DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar, Gavin Abercrombie and Verena Rieser

Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz
 
12:05–12:15 ReproHum #0087-01 (paired):
(virtual) Human Evaluation Reproduction Report for ``Hierarchical Sketch Induction for Paraphrase Generation’‘
Mohammad Arvan and Natalie Parde

(virtual) Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson and Dimitra Gkatzia
 
12:15–12:25 ReproHum #0043-04 (paired):
Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek

Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset
Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger
 
12:25–12:35 ReproHum #0087-01 (paired):
(virtual) Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman and Chenghua Lin

(virtual) A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations
Mingqi Gao, Jie Ruan and Xiaojun Wan
 
12:35–12:45 ReproHum #0033-03 (paired):
How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022
Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas

(virtual) Comparable Relative Results with Lower Absolute Values in a Reproduction Study
Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim
 
12:45–12:51 (virtual) ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation
Saad Mahamood
 
13:00–14:00 Lunch  
14:00–14:45 Oral Session 2  
14:00–14:15 Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging
Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White
 
14:15–14:30 Extrinsic evaluation of question generation methods with user journey logs
Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie
 
14:30–14:45 Towards Holistic Human Evaluation of Automatic Text Simplification
Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling
 
14:45–16:00 ReproNLP Session 2  
14:45–14:51 ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems
Filip Klubička and John D. Kelleher
 
14:51–14:57 (virtual) ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG
Irene Mondella, Huiyuan Lai and Malvina Nissim
 
14:57–15:03 ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines
Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet
 
15:03–15:25 ReproNLP Results Report  
15:25–16:00 ReproNLP and HumEval poster session  
16:00–16:30 Coffee Break  
16:30–17:15 Invited Talk 2

Title: All That Agrees Is Not Gold: Evaluating Ground Truth and Conversational Safety
by Mark J Díaz, Google Research

Conversational safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, and legal concerns. Yet, machine learning approaches often require training and evaluation datasets with a clear separation between positive and negative examples. This requirement overly simplifies the natural subjectivity present in many tasks, and obscures the inherent diversity in human perceptions and opinions about many content items. This talk will demonstrate the inherent, sociocultural challenges of conversational AI safety and highlight the DICES (Diversity In Conversational AI Evaluation for Safety) dataset. The dataset contains fine-grained annotator demographic information, high replication of ratings per item, and encodes annotator votes as distributions across different demographics to allow for in-depth explorations of label aggregation strategies. In addition to describing the development of the dataset, this talk will cover past and ongoing mixed methods analyses of annotator variance, safety ambiguity, and annotator diversity in the context of safety for conversational AI.

NOTE: Due to unforeseen circumstances, Mark J Díaz will not be able to deliver this talk. Instead, it will be given by Ding Wang.
ALT: Invited speaker propic
17:15–17:35 Oral Session 3  
17:15–17:35 Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions
Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman
 
18:00 Closing  
19:00 Dinner