Ducks to the rescue - ETL using Python and DuckDB

The “extract, transform, load” (ETL) pattern has often required managing complex systems in the cloud. The talk will demonstrate how all this can be greatly simplified by applying modern tools for the task: Python and DuckDB, both readily available to run on most systems - even your notebook.

ETL stands for “extract, transform, load” and is a synonym for moving data from one system to another.

Traditionally, ETL was done in exactly that order: first you extract the data you want to process, then you transform it and then you load it into the target system. More modern approaches based on data lakes, swap the T and L, since transformation is more efficiently done in a database system, especially when it comes to large volumes of data.

In order to make all this work, the usual approach is to have a workflow system, taking care of managing all the intermediate steps, a large data lake database and distributed storage systems. This results in lots of complexity, need for system/cluster administration and maintenance.

Now, with today’s computers, most data sizes used in ETL no longer need all this complexity. Even notebooks or single VMs can handle the load, when used with external object storage, so all you really just need is the right software stack to manage your ETL - without all the overhead:

Python has grown to be the number one programming language on the planet and is especially well suited for integration work due to its many readily available connectors to plenty of backend systems. It often comes preinstalled on Linux machines and is easy to install on most other systems.
DuckDB has emerged as one of the most capable embedded OLAP database systems and supports data lakes with the DuckLake extension, right out of the box. Installation is just a uv add duckdb away.

Both can be run on the same machine and are very resource friendly.

The talk will give an overview of the typical steps involved in ETL processes, give a short intro to DuckDB and showcase how DuckDB can be put to good use when implementing ETL processes. If time permits, I can also cover a few advanced topics addressing optimization strategies.

Resources:

Ducks to the rescue - ETL using Python and DuckDB

Saturday, May 30

11:05 - 11:35

Marc-André Lemburg