Workshop on Human Evaluation of NLP Systems, 21 May, LREC-COLING 2024
Programme
All timings are in UTC+2 (Turin, Italy).
Time | Event | |
---|---|---|
09:00–09:10 | Workshop Introduction | |
09:10–10:30 | Oral Session 1 | |
09:10–09:30 | Quality and Quantity of Machine Translation References for Automatic Metrics Vilém Zouhar and Ondřej Bojar |
|
09:30–09:50 | Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French Ayla Rigouts Terryn and Miryam de Lhoneux |
|
09:50–10:10 | Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks Alexander Frummet and David Elsweiler |
|
10:10–10:30 | A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian Aleksandra Miletić and Filip Miletić |
|
10:30–11:00 | Coffee Break | |
11:00–11:45 | Invited Talk 1 Title: Beyond Performance: The Evolving Landscape of Human Evaluation by Sheila Castilho, ADAPT/DCU In the field of MT/NLP evaluation, the focus has traditionally been on system performance metrics. However, as language technology continues to evolve, the methodologies for evaluation must evolve as well. One of the main concerns with the rise of LLMs is the data used for training these systems. The use of vast amounts of synthetic and recycled data poses inherent risks of perpetuating biases and harms in the systems, while also contributing to the stagnation of language diversity. These consequences extend beyond mere performance metrics, impacting language vitality and cultural representation. Human evaluation emerges as a critical tool in identifying and mitigating these risks, promoting fairness, inclusivity, and linguistic diversity in system outputs. This talk will delve into how human evaluation is, more than ever, playing a crucial role in addressing these issues and informing the development of more expressive and contextually relevant translations. |
|
11:45–13:00 | ReproNLP Session 1 | |
11:45–11:55 | Introduction | |
11:55–12:05 | ReproHum #0927-03 (paired): DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text Tanvi Dinkar, Gavin Abercrombie and Verena Rieser Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz |
|
12:05–12:15 | ReproHum #0087-01 (paired): (virtual) Human Evaluation Reproduction Report for ``Hierarchical Sketch Induction for Paraphrase Generation’‘ Mohammad Arvan and Natalie Parde (virtual) Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation Lewis N. Watson and Dimitra Gkatzia |
|
12:15–12:25 | ReproHum #0043-04 (paired): Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger |
|
12:25–12:35 | ReproHum #0087-01 (paired): (virtual) Human Evaluation Reproduction Report for Generating Fact Checking Explanations Tyler Loakman and Chenghua Lin (virtual) A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations Mingqi Gao, Jie Ruan and Xiaojun Wan |
|
12:35–12:45 | ReproHum #0033-03 (paired): How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022 Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas (virtual) Comparable Relative Results with Lower Absolute Values in a Reproduction Study Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim |
|
12:45–12:51 | (virtual) ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation Saad Mahamood |
|
13:00–14:00 | Lunch | |
14:00–14:45 | Oral Session 2 | |
14:00–14:15 | Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White |
|
14:15–14:30 | Extrinsic evaluation of question generation methods with user journey logs Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie |
|
14:30–14:45 | Towards Holistic Human Evaluation of Automatic Text Simplification Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling |
|
14:45–16:00 | ReproNLP Session 2 | |
14:45–14:51 | ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems Filip Klubička and John D. Kelleher |
|
14:51–14:57 | (virtual) ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG Irene Mondella, Huiyuan Lai and Malvina Nissim |
|
14:57–15:03 | ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet |
|
15:03–15:25 | ReproNLP Results Report | |
15:25–16:00 | ReproNLP and HumEval poster session | |
16:00–16:30 | Coffee Break | |
16:30–17:15 | Invited Talk 2 Title: All That Agrees Is Not Gold: Evaluating Ground Truth and Conversational Safety by Mark J Díaz, Google Research Conversational safety as a task is complex, in part because ‘safety’ entails a broad range of topics and concerns, such as toxicity, harm, and legal concerns. Yet, machine learning approaches often require training and evaluation datasets with a clear separation between positive and negative examples. This requirement overly simplifies the natural subjectivity present in many tasks, and obscures the inherent diversity in human perceptions and opinions about many content items. This talk will demonstrate the inherent, sociocultural challenges of conversational AI safety and highlight the DICES (Diversity In Conversational AI Evaluation for Safety) dataset. The dataset contains fine-grained annotator demographic information, high replication of ratings per item, and encodes annotator votes as distributions across different demographics to allow for in-depth explorations of label aggregation strategies. In addition to describing the development of the dataset, this talk will cover past and ongoing mixed methods analyses of annotator variance, safety ambiguity, and annotator diversity in the context of safety for conversational AI. NOTE: Due to unforeseen circumstances, Mark J Díaz will not be able to deliver this talk. Instead, it will be given by Ding Wang. |
|
17:15–17:35 | Oral Session 3 | |
17:15–17:35 | Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman |
|
18:00 | Closing | |
19:00 | Dinner |