Why LLM Evaluation Is Hard, and How to Do It Anyway

Learn how to reliably evaluate LLM-powered systems in production. Understand how to turn expert knowledge into automated tests, use LLMs as calibrated judges, and combine metrics and human feedback to drive continuous system improvement.

As large language models become integral to real-world applications, robust and scalable evaluation methods are essential. However, evaluating free-form text generated by LLMs is inherently challenging: there is rarely a single correct answer, outputs vary widely in form, and errors are often subtle or partially correct. Generic benchmarks are poorly suited for evaluating domain-specific and multi-step reasoning in LLM-powered systems. At the same time, manual inspection of every intermediate step does not scale, while fully automated LLM-based judges often lack the domain context and expertise required for reliable assessment.

In this talk, we present a practical framework for evaluating LLM-powered systems and enabling continuous improvement in production. In particular, we will:

Show how to translate expert knowledge into automated tests that guard against regressions (4 min);
Present expert-driven evaluation and error analysis, covering automated collection of model traces and building the golden dataset (7 min);
Introduce LLMs as judges, exploring their strengths, limitations, calibration strategies, and practical frameworks such as Inspect Evals, DeepEval and AgentEvals (12 min);
Demonstrate how metrics, expert feedback, and LLM-based evaluation can be combined into an iterative improvement and monitoring loop for production systems (4 min).

By the end of the talk, attendees will have a clear understanding of how to design a human-in-the-loop evaluation process and integrate it into iterative improvement cycles to continuously refine LLM-based systems over time.

Target audience: ML engineers, data scientists, and practitioners building or deploying LLM-powered systems.

The attendees are assumed to have basic familiarity with LLMs and ML workflows.

Why LLM Evaluation Is Hard, and How to Do It Anyway

Thursday, May 28

11:45 - 12:15

Iryna Kondrashchenko