Learn how to reliably evaluate LLM-powered systems in production. Understand how to turn expert knowledge into automated tests, use LLMs as calibrated judges, and combine metrics and human feedback to drive continuous system improvement.
As large language models become integral to real-world applications, robust and scalable evaluation methods are essential. However, evaluating free-form text generated by LLMs is inherently challenging: there is rarely a single correct answer, outputs vary widely in form, and errors are often subtle or partially correct. Generic benchmarks are poorly suited for evaluating domain-specific and multi-step reasoning in LLM-powered systems. At the same time, manual inspection of every intermediate step does not scale, while fully automated LLM-based judges often lack the domain context and expertise required for reliable assessment.
In this talk, we present a practical framework for evaluating LLM-powered systems and enabling continuous improvement in production. In particular, we will:
By the end of the talk, attendees will have a clear understanding of how to design a human-in-the-loop evaluation process and integrate it into iterative improvement cycles to continuously refine LLM-based systems over time.
Target audience: ML engineers, data scientists, and practitioners building or deploying LLM-powered systems.
The attendees are assumed to have basic familiarity with LLMs and ML workflows.
Iryna is a data scientist and co-founder of DataForce Solutions GmbH, a company specialized in delivering end-to-end data science and AI services. She contributes to several open-source libraries, and strongly believes that open-source products foster a more inclusive tech industry, equipping individuals and organizations with the necessary tools to innovate and compete.