Talk

Ducks to the rescue - ETL using Python and DuckDB

Saturday, May 30

11:05 - 11:35
RoomTortellini
LanguageEnglish
Audience levelIntermediate
Elevator pitch

The “extract, transform, load” (ETL) pattern has often required managing complex systems in the cloud. The talk will demonstrate how all this can be greatly simplified by applying modern tools for the task: Python and DuckDB, both readily available to run on most systems - even your notebook.

Abstract

ETL stands for “extract, transform, load” and is a synonym for moving data from one system to another.

Traditionally, ETL was done in exactly that order: first you extract the data you want to process, then you transform it and then you load it into the target system. More modern approaches based on data lakes, swap the T and L, since transformation is more efficiently done in a database system, especially when it comes to large volumes of data.

In order to make all this work, the usual approach is to have a workflow system, taking care of managing all the intermediate steps, a large data lake database and distributed storage systems. This results in lots of complexity, need for system/cluster administration and maintenance.

Now, with today’s computers, most data sizes used in ETL no longer need all this complexity. Even notebooks or single VMs can handle the load, when used with external object storage, so all you really just need is the right software stack to manage your ETL - without all the overhead:

  • Python has grown to be the number one programming language on the planet and is especially well suited for integration work due to its many readily available connectors to plenty of backend systems. It often comes preinstalled on Linux machines and is easy to install on most other systems.

  • DuckDB has emerged as one of the most capable embedded OLAP database systems and supports data lakes with the DuckLake extension, right out of the box. Installation is just a uv add duckdb away.

Both can be run on the same machine and are very resource friendly.

The talk will give an overview of the typical steps involved in ETL processes, give a short intro to DuckDB and showcase how DuckDB can be put to good use when implementing ETL processes. If time permits, I can also cover a few advanced topics addressing optimization strategies.

Resources:

TagsDatabases, Data Engineering, Performance and scalability techniques
Participant

Marc-André Lemburg

Marc-Andre is the CEO and founder of eGenix.com, a Python-focused boutique project and consulting company based in Germany, specializing in the data, finance and database space. He has a degree in mathematics from the University of Düsseldorf.

His work with and for Python started in 1994. He is a Python Core Developer, designed and implemented the Unicode support in Python, the editor of the Python DB-API and author of several open source libraries and tools (e.g. the mx Extensions mxDateTime and mxODBC).

Marc-Andre is a EuroPython Society (EPS) Fellow, a Python Software Foundation (PSF) founding Fellow and co-founded a local Python meeting in Düsseldorf (PyDDF). He served on the board of the PSF and EPS for many years and loves to contribute to the growth of Python wherever he can.

More information is available on https://malemburg.com/