AutoML summer school - 2025 Tübingen

I gave a lecture on the challenges of LLM evaluations at the AutoML School 2025.

I was a great pleasure to discuss with folks participating on the summer school, thanks a lot to the organizer for inviting me!

In case you are interested, here are the slides.

The end of the talk discusses and advocates for fully open-sourced LLMs and describes OpenEuroLLM, a recent European project on this.

We will still have job offerings for OpenEuroLLM at the ELLIS Institute Tübingen (see the slides at the end of the talk). If you are interested the platform to apply should be available next week!

Here is the full abstract of the talk:

Challenges in Evaluating Large Language Models Large Language Models (LLMs) are increasingly being used in various applications, from basic information retrieval to coding and beyond. As a result, many actors share LLM models, either through blackbox services or open-weight access. Therefore, benchmarking is crucial to determine which model is more suitable for specific use cases.

The generative nature of LLMs, which allows them to generate text, poses a significant challenge in evaluating their performance. A wide range of prompts can be generated, including translation, coding assistance, travel advice, and cooking recipe suggestions. This diversity makes it difficult to assess the effectiveness of LLMs.

In this talk, I will explore various evaluation approaches that have been proposed and discuss their advantages and limitations. In particular, I will focus on LLM-judges, a family of methods that use an LLM to evaluate the free-text generated by another LLM. I will also discuss how AutoML can significantly improve such systems and enable the use of open-weight models instead of closed services like GPT-4.

Written on June 14, 2025