Designing Rule-Driven Transformation Engine in PySpark (Without Losing Your Sanity)

Config-driven transformations often turn PySpark notebooks into fragile monsters. This talk shows how to design a rule-based transformation architecture in Python that stays fast, testable, and maintainable—even as rules, formats, and edge cases keep growing.

Rule-based data transformations look simple—until they aren’t. My first attempt was exactly what many of us try: Python loops, conditional logic, and quick fixes inside a PySpark notebook. It worked… until performance dropped, rules multiplied, and testing became nearly impossible.

This talk tells the story of how that initial approach failed, and what changed when I stopped treating transformation rules as code and started treating them as data.

We’ll explore how to design a rule-driven transformation architecture in PySpark where behavior is defined by external configuration, but implemented with explicit, testable logic. Along the way, I’ll share the trade-offs I had to face: when Spark-native expressions are enough, when UDFs are unavoidable, and how small architectural decisions can make a notebook either evolvable—or brittle.

You’ll see practical patterns for:

Modeling transformation rules outside the core logic
Applying dynamic mappings and aggregations safely
Keeping performance under control as complexity grows
Structuring notebook code so it can actually be tested

This is not a “do it my way” talk. It’s a reflection on mistakes, constraints, and design decisions that emerged from a real problem—and lessons that apply to many PySpark transformation workflows.

The talk is aimed at data engineers and developers with basic Spark experience who want to move beyond ad-hoc transformations toward more robust and maintainable designs.

Designing Rule-Driven Transformation Engine in PySpark (Without Losing Your Sanity)

Saturday, May 30

11:05 - 11:35

THOMAS SCARDONI