Workshop

Write Your First High-Performance GPU Kernel in Python!

Friday, May 29

11:00 - 13:00
RoomTigelle
LanguageEnglish
Audience levelIntermediate
Elevator pitch

With the GPU boom and CUDA a high barrier to entry, Python has become the practical bridge between developer productivity and performance. This workshop teaches hardware-aware GPU programming for Python developers, with a focus on how performance is less about model code and more about data movement

Abstract

This is a fully hands-on workshop focused on writing your first high-performance GPU kernel in Python. Instead of starting with APIs, participants will begin by benchmarking a naive Python GPU kernel and observing why it fails to scale. From there, each section introduces a single hardware concept, memory hierarchy, arithmetic intensity, tiling, and fusion followed immediately by a coding exercise that applies it.

Participants will progressively transform slow Python kernels into efficient Triton implementations, learning how Python is lowered into PTX and how respecting GPU hardware constraints enables near-CUDA performance. Every concept is reinforced through code, measurement, and performance comparison.

Here is workshop flow:

  1. Environment Warm-up & Baseline

    • Verify GPU, Triton, PyTorch setup
    • Run the baseline vector-add kernel
    • Record wall-clock time
  2. Why Your GPU Code Is Slow

    • GPUs are latency-hiding throughput machines
    • Why FLOPs don’t matter when memory stalls
    • Predict the bottleneck before profiling
  3. The Hardware You’re Actually Programming

    • SMs, warps, occupancy — only what matters
    • Memory hierarchy: registers → shared → L2 → HBM
    • Why global memory dominates everything
  4. Decide Before You Optimize: Roofline

    • Arithmetic intensity in practice
    • Compute-bound vs memory-bound
    • Classify the baseline kernel
  5. Hit the Memory Wall

    • Profile the baseline kernel
    • Measure achieved bandwidth
    • Explain failure using hardware constraints
  6. Hardware-Aware Patterns That Work

    • Thinking in blocks, not threads
    • Tiling and data reuse
    • Why fusion is the only real speedup
  7. Your First Fast Python GPU Kernel

    • Write a Triton kernel
    • Trace Python → Triton IR → PTX
    • Compare performance with baseline
  8. Kill the Memory Wall & Wrap-Up

    • Fuse two ops into one kernel
    • Final benchmark

By the end of the workshop, participants will be able to:

  • Explain why most Python GPU code is memory-bound
  • Identify performance bottlenecks using the roofline model
  • Understand GPU memory hierarchies and latency hiding
  • Apply hardware-aware design patterns such as tiling and fusion
  • Write and benchmark custom GPU kernels in Python using Triton
TagsHardware, Performance and scalability techniques, ML and AI
Participant

Abhik Sarkar

I am Abhik Sarkar, a machine learning engineer focused on building real-world computer vision systems that actually run at scale. My work lives at the intersection of software engineering, GPU hardware, and production reliability .

I currently lead machine learning at Cloudastructure, where I design end-to-end vision pipelines spanning high-throughput video ingestion, GPU-accelerated decoding, and low-latency inference. My daily toolset includes PyTorch, TensorRT, NumPy, OpenCV, CuPy, PyCUDA, and ONNX Runtime.

Outside of work, I actively seek technical discussions and regularly attend conferences to understand how engineers around the world approach hard problems. I care as much about learning as I do about sharing, and I make a deliberate effort to pass on whatever I know in a form others can actually use.

In my free time, I cook, and I make chocolate bars from raw cacao beans.