How well do chatbots or AI translators perform?

We all make mistakes: we are human after all. But what about machines? As they are programmed by humans, they are also prone to errors. In the case of automatic text generation, researchers from the Center of Artificial Intelligence (CAI) developed a new way of proof checking the various systems – fully automated. Machines evaluating machines – can that work?

In their paper “On the Effectiveness of Automated Metrics for Text Generation Systems”, Pius von Däniken, Jan Deriu, Don Tuggener and Mark Cielibak put forth a new automated approach to evaluate text generations systems. What is automatic text generation you ask? You probably know and use some of these systems. Examples are machine translation (DeepL, Google Translate etc.), dialogue systems (chatbots, like ChatGPT), automated summarization (for example the summary feature of Google Docs), paraphrasing, caption generation, or natural language generation. Broadly speaking, the field of Text Generation is a subfield of Natural Language Processing (Celikyilmaz et al., 2020). Let’s dive into it.

Jan, you recently published a paper on the evaluation of Text Generation Systems. Why do we need to evaluate these systems and what are the challenges?

Text generation systems such as DeepL need to be checked for errors to ensure the quality of their output, in this example the translation of a word or text. This task can be done by humans and/or machines. This evaluation is an unsolved issue as human evaluation is more reliable but takes much more time and is more expensive than automated evaluation. Our paper takes a small step towards a solution.

You came up with a theoretical foundation. Can you tell us more?

We are developing a theoretical framework for the evaluation of text generation that can be used to extract guidelines for designing a quantitative evaluation of text generation systems. In this paper, we introduce the first results of this framework, which works on a simplified version of a metric, i.e., so-called binary metrics.

The binary metrics classify the output of a text generation system as being either adequate or inadequate. This allows us to measure the performance of a text generation system as the ratio of adequate responses it generates. The main problem of these automated metrics is that they are error prone. That is, they rate inadequate responses as being adequate. This leads to faulty evaluations, which in turn leads to low trust in these metrics. Our framework allows to correct the evaluation of these metrics by leveraging human evaluations. That is, we can boost a human evaluation with metric evaluations, which requires fewer human annotations in the end.

Our framework also allows to establish how many samples are required to reliably distinguish text generation systems (meaning: when the difference is statistically significant). With the code we’ve written, we can evaluate one system – we can compute error-free, error-prone and mixed metrics. In a second step we can also compare the performance of different systems and check if the application mirrors the theory.

Are you planning to continue this research?

Yes, we definitively stay on this path, as we are the first to do so. The main work for our paper was established in the Master thesis by Pius. The current theory is limited to binary metrics, but in future work, we will extend the theory to more types, such as comparative or scalar metrics. Furthermore, we will apply the theory to a wider range of tasks and domains.

Who can use your theoretical foundation? Is the code open access?

Yes, you can find the code and visualization tool here. Practitioners can use it to enter their evaluation settings and receive an analysis of the measurable differences in their settings. Also, we are currently looking for collaboration partners and we plan to work in interdisciplinary teams with for example linguists.

Your team presented the paper at the GEM workshop 2022. What were the reactions?

Pius, a Master Student in our group, presented our work in the GEM workshop as a poster. The reactions were positive. There is ever-growing interest in the evaluation of text generation systems as they are becoming more and more powerful, and thus, more and more used. Especially the error-correction aspect of the theory was well received.