What happens when you give a hard scientific problem to 1,000 Python users? In this talk, I share lessons from designing and running a Kaggle challenge to detect weak long-lasting gravitational-wave signals, showing how AI, feature engineering, and GPUs helped rethink a decades-old search problem
Detecting weak, long-duration signals in a noisy background is a fundamental challenge across many scientific domains. Continuous gravitational waves, expected from rapidly spinning neutron stars, are a prime example: they are persistent signals that hide deep in the noise of current interferometric gravitational-wave detectors (LIGO, Virgo, and KAGRA) and remain undetected despite decades of dedicated effort by the community.
The search for these signals poses extreme data-analysis challenges, as it requires filtering massive datasets against billions of waveform templates while operating under strict false-alarm constraints. The resulting computational cost quickly becomes unaffordable, preventing the use of statistically optimal methods and forcing difficult trade-offs between sensitivity and feasibility.
To explore alternative approaches and benefit from recent advances in GPU computing and artificial intelligence, we opened this problem to the wider data-science community by hosting a large, open Kaggle competition [0]. More than 1,000 participants used Python, machine learning, and data-driven techniques to rethink how such signals can be detected in real, noisy data.
In this talk, I will discuss the design, development, and outcomes of this challenge. I will cover the key design principles required to build a scientifically realistic yet accessible competition, including handling massive datasets, defining meaningful performance metrics, and providing sufficient physical and statistical context to enable participants to engage with a complex scientific problem.
I will then evaluate the top Kaggle solutions, showing how the best-performing approaches combined feature engineering, machine-learning models, and GPU acceleration to achieve reductions in computational cost of one to three orders of magnitude compared to standard workflows.
Beyond its astrophysical motivation, this effort resulted in the first open, standardized benchmark for continuous gravitational-wave detection, released as a reusable dataset. The lessons learned extend well beyond gravitational-wave astronomy and are broadly applicable to Python developers working on noisy data, large-scale inference, reproducibility, and performance-critical applications.
After this talk, attendees will:
-Understand how to design and structure a large-scale data challenge using Python and open datasets
-Learn practical strategies for detecting weak signals in noisy, real-world data
-See how machine learning and feature engineering can complement traditional statistical methods
-Gain insight into when and how GPU acceleration in Python provides real performance benefits
My name is Rodrigo Tenorio, and I am a postdoctoral researcher at the University of Milano-Bicocca. My work sits at the intersection of scientific computing, data analysis, and high-performance computing. I have over five years of experience using Python for anomaly detection, Bayesian inference, and machine-learning analysis of large time-series datasets, including data from gravitational-wave detectors.
I contribute to several open-source Python data-analysis packages, including PyFstat [1], which played a key role in the development of the Kaggle challenge discussed in this talk. In my free time I enjoy learning about multiple aspects of computer science, such as functional programming, GPU computing, distributed systems, and automatic differentiation. Outdoor sports are also a must.
[0] R. Tenorio et al., Machine Learning: Science and Technology 6, 040702 (2025)