Delta and Iceberg solved data lake chaos but introduced a new one: layers of JSON metadata everywhere. DuckLake lets you build a lakehouse with just two tools: a SQL database and a blob storage bucket. The result? True multi-table ACID transactions, time travel and metadata layer in pure SQL.
We have spent the last five years complicating the Data Lakehouse. We built towers of JSON manifest files, wrangled heavy JVM-based catalogs, and debugged “eventual consistency” issues just to query a Parquet file. It’s time to stop the madness. Enter DuckLake, a new open table format that challenges the dominance of Iceberg and Delta Lake by asking a radical question: what if we managed table metadata in a simple database instead of a complex file system?
In this session, we will explore the architecture of DuckLake, which couples the low-cost storage of S3 with the transactional guarantees of a standard SQL database. We will demonstrate how to build a fully ACID-compliant Lakehouse using only Python and DuckDB—no Spark cluster required. You will see live code for creating schemas, performing time travel, and managing concurrent writers.
However, no architecture is a silver bullet. We will candidly discuss where DuckLake struggles, specifically addressing:
Scale limits: The inflection point where a distributed catalog or engine becomes necessary over a lightweight SQL approach.
Ecosystem Maturity: The trade-offs of using a newer format versus the broad compatibility of established giants like Delta Lake and Iceberg.
Key Takeaways:
Understand the “Metadata-in-Database” architecture and why it outperforms file-based manifests for small-to-medium workloads.
Learn to deploy a serverless, multi-user Data Lakehouse using the Python ecosystem.
Gain a pragmatic framework for selecting between lightweight Lakehouses and enterprise-grade clusters.