A brave new World
Every once in a while, a new technology with a significant societal impact comes along. To many, it seems that ChatGPT fits this criterion. Scrolling through Twitter and LinkedIn, one might get the impression that NLP is solved or that we are all doomed (depending on whom you follow). Although ChatGPT is awe-inspiring at times, we know that there are many tasks at which it fails. Thus, one crucial requirement of our field is to tackle the issue of evaluation. Or put differently: how do we know whether ChatGPT (or any of its descendants) can solve a task or not? Looking at how ChatGPT is currently evaluated, we see many cherry (or lemon) picked examples on Twitter that showcase how well (or poorly) ChatGPT performs on a particular task.
Unfortunately, a closer look at the current literature about the evaluation of text generation systems reveals no conclusive answer. To this day, evaluating a text generation model is more art than science. Although it might be fun to design a custom evaluation, there does not exist a standard protocol or guidelines which define a framework in which to move freely. Simply using a human evaluation is not the simple answer one hopes to believe. They are hard to set up, and reproduce, are to be handled with much care, and are time- and cost-intensive. Also, the recent advances in automated metrics do not solve the problem. They are still too brittle, have too high of a disagreement with human ratings, and are uninterpretable.
As deep learning started to gain traction (this was ages ago, in 2016, when we were young and naive), we at the Centre of Artificial Intelligence at the ZHAW thought it would be a good idea to work on text-generation tasks. When we tried to reproduce the Seq2Seq architecture for Conversational Dialogue Systems or built our first NLG model, it became apparent that evaluating these generative models is a real challenge. Over the years, we delved more into the problem of evaluating text generation systems. One thing that did not happen was that we got the feeling of knowing how to really evaluate our models. Even worse, the more we worked on this topic, the more we got the impression of knowing less and less (according to Plato’s version of Socrates, this might be a sign of growing wisdom).
Recently, we took a step back and tried understanding what makes this problem so hard. While we still do not have a conclusive answer to this question (and maybe never will), we came to the understanding that in most evaluations, there are multitudes of sources of uncertainty, which are rarely accounted for. Thus, we started a research program where we aim to develop a statistical model which captures these sources of uncertainty, yielding more interpretable and useful evaluations.
The aim of this Blog series is twofold: first, we want to motivate that this is generally a good idea, and second, we want to take you on this journey and hope to spark some interesting discussions (maybe our approach is entirely misguided, and we’d like to know before it’s too late 😉 ).
We start the series in this post, where we introduce a fictional showcase (based on our painful experience) to highlight the main issues with evaluation, as it is done today. The aim is to motivate the need for more rigor in the evaluation and a more formal approach.
Fictional Showcase: Summarizing Meeting Transcripts
We all have too many meetings, and more often than not, we either can not participate or it is hard to keep track of all the outcomes and decisions (which happens when you spend the time in the meeting answering emails). Thus, the vision is to have a system where you can upload the meeting recording, which returns a summary of the meeting. One issue is that summarization tools are prone to ignore key elements in the text or hallucinate facts that do not appear in the original text.
In this scenario, we assume we have built a new transcript summarization model called SummPT. We are at the stage where we want to compare the performance of SummPT to a set of five state-of-the-art models. We are particularly interested in the frequency at which it hallucinates (This is quite important as we want to avoid the summary stating that you are getting fired when in fact, your boss stated that your code is fya).
In what follows, we will introduce problems with evaluating this system. In particular, we focus on various sources of uncertainty, which are introduced at various stages of the evaluation. We will look at both human and automated evaluation. Note that all the numbers are fictional. They are there to help highlight the problem.
From definition to rating – what can go wrong?
The first step in evaluating our text summarization system is to ask ourselves what we want to know. We need to clearly state what characteristic is to be evaluated. This is not straightforward, as it entails defining the characteristics of behavior we wish to see and behavior we do not want to see (For instance, in our paper, we discussed various ideas of what one might be interested in when summarizing a dialogue). In most cases, there is a tradeoff between the definition being too fuzzy and thus too open for interpretation by human raters and the definition being too strict and the result not meaningful (more on this here).
Often, text summarization is evaluated for coherence, consistency, fluency, and relevance. However, in our case, we are interested in measuring the hallucination rate. Thus, we spend several hours trying to generate a precise definition of what we consider “hallucinations.” As a result, we have a reasonable definition. Still, at the same time, we are aware that we cannot cover all the edge cases, which we hope can be dealt with by smart humans.
Thus, the first source of uncertainty in the evaluation pipeline stems from the definition of what we want to measure, combined with the subjectivity of the human raters. We posit that it is infeasible to eradicate this problem completely. We can measure its effects when considering the inter-annotator agreement. However, we argue (as have others) that disagreement needs to be handled differently.
Who rates the summaries?
On top of the uncertainty introduced by the fuzziness of the definitions, we have to deal with more uncertainty/noise stemming from the raters themselves. This can have a variety of causes.
- The task is hard: rating summaries of transcripts is challenging (and mentally exhaustive). One needs to check for every statement in the summary whether it appeared in the transcript, which at times is hard to read (as an exercise, we encourage the reader to read the MSTeams transcript of their next meeting).
- Domain expertise is needed. There are tasks where factual correctness can only be assessed by domain experts. This becomes more apparent when considering the ever-increasing performance of text generation systems, which requires more expertise in finding mistakes (e.g., a professional translator might only be able to spot certain errors).
- Attention to detail is missing. In our task, attention to detail is paramount: to count the hallucination rate, the summary must be compared to the transcript, and some hallucinations are quite subtle.
Thus, there are many different sources of uncertainty and noise, which differ from the above in that we do not wish to have the kind of uncertainty stemming from errors. While in the above case, the raters disagreed based on their subjective (but still truthful) interpretation, here, the uncertainty stems from not-intended behavior. In most cases, the cause lies in the design of the experiment, which elicits the errors made by the raters (although you are never safe from ill-intended and lazy workers – or you should be less stingy).
How much to annotate – get statistical significance at any cost?
Of course, we avoided all the previous pitfalls (and many more, which you can read here, and also here), and we decided to annotate 30 summaries per summarization system (following some misinterpreted heuristics found on the internet) for each summarization system. Each summary is annotated by three humans. We perform a binary annotation (0->hallucination present, 1->no hallucination), and we simply state if one of the three ratings states hallucinations, then the label is set to hallucination.
In our hypothetical scenario, let’s assume that none of the six systems (SummPT and the other five SOTA systems) have significantly different hallucination rates. Thus, we spend more of our grant on annotating more and more samples until we reach statistical significance. After annotating 300 summaries per system, we get that SummPT has a hallucination rate of 6% and the second best SOTA system has a rate of 6.5% (Thus, we write a paper stating our results and hope to get accepted at a top-tier conference).
Here, the uncertainty stems from the sample size used to assess the hallucination rate. The more samples we use, the smaller the uncertainty gets. However, we note that the trend of simply accumulating more ratings until we reach statistical significance might not be the best route to go (however, much of machine learning does this). The difference between SummPT and the other might be statistically significant; however, the magnitude is not really large. Furthermore, the decision at which to state statistical significance is just an arbitrary decision on how to interpret the underlying probability distribution.
Finally and most importantly, when considering all the previous sources of uncertainty, which are often not modeled when computing statistical significance, we should have serious doubts about whether the difference is really significant.
Automating the evaluation – what is the best metric?
Since we are not quite convinced that the difference is statistically significant, we spend more time improving SummPT. Furthermore, although SummPT is “better” than the other systems, it is still far from being good, with 6% of hallucinations present in the summaries. However, we do not have much money to run another extensive and expensive human evaluation. Thus, we resort to automated evaluation since it is cheaper and also easily reproducible.
There are already many different metrics to choose from. Thus, we aim to choose the metric that best correlates to our human judgments. Of course, the first reflex is to use the ROUGE score, used by all the summarization papers. However, since we are skeptical about ROUGE’s capability of finding hallucinations, thus, we use HALUC, a recently developed (fictive) metric (with a terrible acronym), to measure the hallucination rate of a summarization system.
We use expert-rated summaries (from the above human evaluation) to select the best metric and compute the correlation to human judgments. Our (fictional) evaluation yields a correlation of 0.45 between HALUC and human judgments. Now the question is: what do these numbers tell us?
In fact, these numbers are of no use as they do not tell us anything about how to interpret an evaluation run with these metrics. If system A gets a HALUC score of 55.3, system B gets a HALUC score of 55.9, and the difference is statistically significant, we still can not state which system is better since we do not know the impact of the errors made by HALUC.
Here we uncover two sources of uncertainty. First, the obvious problem is that the metrics only correlate to a certain degree with human judgments. Thus, when running an evaluation with a metric, the disagreement with the humans must be factored in. The second, less obvious problem is that measuring the error rate of the metric is done by comparing the metric outputs to the human outputs. However, as we have seen above, human ratings are already uncertain. Thus, this uncertainty of human ratings is propagated to the error-rate computation for automated metrics.
The uncertainty gets off the charts.
To summarize, the showcase highlights that the evaluation process of text generation systems accumulates many sources of uncertainty that propagate through the pipeline (interesting side-note: in experimental physics, this phenomenon is called Propagation of Uncertainty). This uncertainty lies in the nature of the task, and instead of trying to get rid of it, we posit to model it.
Although everything might seem very gloomy after reading this post, one must not forget that measurement of any kind is complex. It took almost 200 years to come to the current definition of a meter (I can recommend this read: History of the Meter). It also took a painstaking amount of time and effort to measure temperature (Timeline Temperature). Many disciplines are struggling with finding good ways of measuring effects.
In this series of Blog Posts, we will discuss the problem of evaluating systems designed for text generation. Be that machine translation, conversational dialogue systems, automated text summarization systems, or classical NLG (data-to-text).
In the following posts, we will introduce our work that tackles or exposes these problems. Doing this, we introduce a statistical framework to resolve these issues (no worries, we will be gentle and won’t bother you with the mathematical details), and we will derive guidelines for running more robust evaluations.