Write Your First High-Performance GPU Kernel in Python!

With the GPU boom and CUDA a high barrier to entry, Python has become the practical bridge between developer productivity and performance. This workshop teaches hardware-aware GPU programming for Python developers, with a focus on how performance is less about model code and more about data movement

This is a fully hands-on workshop focused on writing your first high-performance GPU kernel in Python. Instead of starting with APIs, participants will begin by benchmarking a naive Python GPU kernel and observing why it fails to scale. From there, each section introduces a single hardware concept, memory hierarchy, arithmetic intensity, tiling, and fusion followed immediately by a coding exercise that applies it.

Participants will progressively transform slow Python kernels into efficient Triton implementations, learning how Python is lowered into PTX and how respecting GPU hardware constraints enables near-CUDA performance. Every concept is reinforced through code, measurement, and performance comparison.

Here is workshop flow:

Environment Warm-up & Baseline
- Verify GPU, Triton, PyTorch setup
- Run the baseline vector-add kernel
- Record wall-clock time
Why Your GPU Code Is Slow
- GPUs are latency-hiding throughput machines
- Why FLOPs don’t matter when memory stalls
- Predict the bottleneck before profiling
The Hardware You’re Actually Programming
- SMs, warps, occupancy — only what matters
- Memory hierarchy: registers → shared → L2 → HBM
- Why global memory dominates everything
Decide Before You Optimize: Roofline
- Arithmetic intensity in practice
- Compute-bound vs memory-bound
- Classify the baseline kernel
Hit the Memory Wall
- Profile the baseline kernel
- Measure achieved bandwidth
- Explain failure using hardware constraints
Hardware-Aware Patterns That Work
- Thinking in blocks, not threads
- Tiling and data reuse
- Why fusion is the only real speedup
Your First Fast Python GPU Kernel
- Write a Triton kernel
- Trace Python → Triton IR → PTX
- Compare performance with baseline
Kill the Memory Wall & Wrap-Up
- Fuse two ops into one kernel
- Final benchmark

By the end of the workshop, participants will be able to:

Explain why most Python GPU code is memory-bound
Identify performance bottlenecks using the roofline model
Understand GPU memory hierarchies and latency hiding
Apply hardware-aware design patterns such as tiling and fusion
Write and benchmark custom GPU kernels in Python using Triton

Write Your First High-Performance GPU Kernel in Python!

Friday, May 29

11:00 - 13:00

Abhik Sarkar