Why does your LLM give different answers at temperature zero? It’s not floating-point chaos, it’s… other users! Let’s start together a myth-busting journey into inference engines to understand the real cause of nondeterminism.
Deterministic LLM inference sounds simple: set temperature to zero and get consistent answers. But reality is messier, and asking the same question a thousand times might yield dozens of different responses.
While GPU parallelism and floating-point arithmetic play a role, a study by Thinking Machine Labs [1] identifies the real culprit as… other users. Concurrent requests to the same server change the batch size, which silently alters the order of transformer computations, making the output look like uncontrollable randomness.
This talk will debunk the common myths about LLM nondeterminism, walk through the actual mechanics of inference engines, and explain what it would take to achieve true reproducibility.
Whether calling an external API or running a self-hosted inference stack, attendees will leave with a clear understanding of why this happens, which strategies can help address it, and how to think about reproducibility in AI systems.
Talk outline: • 0-8 min: Introduction to the problem, common myths, and real-world examples • 8-18 min: How concurrent user requests affect inference • 18-25 min: Implications in production systems and remediation strategies • 25-30 min: Conclusion and Q&A
[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/