Reproducibility in embedding benchmarks is challenging, especially with models and datasets that are increasingly large. Learn how MTEB tackles prompt variability, scaling issues, and large datasets to ensure fair and consistent evaluations, setting a standard for benchmarking in embeddings.
Reproducibility in embedding benchmarks is no small feat. Prompt variability, growing computational demands, and evolving tasks make fair comparisons a challenge. The need for robust benchmarking has never been greater.
The Massive Text Embedding Benchmark (MTEB) addresses these challenges with a standardized, open-source framework for evaluating text embedding models. Covering diverse tasks like clustering, retrieval, and classification, MTEB ensures consistent and reproducible results. Extensions like MMTEB (multilingual) and MIEB (image) further expand its capabilities.
In this talk, we’ll explore the quirks and complexities of benchmarking embedding models, such as prompt sensitivity, scaling issues, and emergent behaviors. We’ll show how MTEB simplifies reproducibility, making it easier for researchers and industry practitioners to measure progress, choose the right models, and push the boundaries of embedding performance.
My focus is on making AI systems scalable and maintainable. Currently I’m a Staff Machine Learning Scientist at Zendesk. My background is in Aerospace Engineering and Machine Learning and I hold undergraduate (B.A.Sc in EngSci) and graduate (M.A.Sc) degrees from the University of Toronto. In my spare time, I try to contribute to open source projects (e.g. MTEB), see the world, and stay active. These days I’m also into cycling, running, and hiking.