Hacking an AI Brain: Frankenstein Experiments with Language Models

We dissect transformer models with code, running causal experiments: tracing activations, patching signals between layers, and deliberately breaking model behaviors to understand what changed—and why. A hands-on tour of mechanistic interpretability, in practice.

This talk treats transformer language models as experimental creatures. We open them up in code, expose attention heads, residual streams, and neuron activations, and ask precise questions by intervening directly in their internal machinery.

Across a series of hands-on “Frankenstein experiments,” we trace activations through layers, patch signals between model states, and deliberately rewire internal components to observe how behavior mutates. These interventions reveal concrete circuits behind capabilities such as copying, induction, and factual recall, and allow those hypotheses to be tested causally rather than inferred.

All experiments are implemented in Python using PyTorch, TransformerLens, and nnterp. The focus is on post-hoc investigation: reading internal representations, performing controlled manipulations, and measuring how specific internal changes reshape language generation.

The session presents mechanistic interpretability as an experimental science: models are dissected, modified, and reassembled to see what survives. Attendees will leave with a working mental model of transformer internals, practical tools for running their own interpretability experiments, and a sharper intuition for how reasoning emerges from neural machinery.

Hacking an AI Brain: Frankenstein Experiments with Language Models

Friday, May 29

11:00 - 11:30

Giuseppe Birardi