Designing for High Availability at scale: a real-word PG DBaaS scenario

Everybody talks about high availability, everybody wants high availability. And, indeed, options exist for high availability in PostgreSQL.

But what does “high” really mean? And how do you design for HA at scale, when you need to manage tens of thousands of services? A real-world scenario.

Aiven manages tens of thousands of services across multiple clouds.

Offering a fully managed service does not equate to just “running some software in the cloud” - it comes with a lot of strings attached, including customer expectation management, allowing many internal operators to work on systems they didn’t design, and handling cloud provider quirks - sometimes, all at once.

Hence, an High Availability solution should not just be that - high availability; it needs to be observable, easy to operate, provide fast and simple ways out of common and less common problematic paths, and offer clear performance and reliability guarantees to the customers.

This talk will provide insights about the challenges of operating such services at scale, and how we solved those with our HA implementation, leveraging Patroni and pgBackRest. Some of the topics that will be discussed:

Frequent-at-scale but non-obvious failure modes;
CAP theorem and how it translates to what happens in the real world;
Designing for security;
Enabling constraints - how to make your service more useful for customers by taking away features rather than adding those.

Designing for High Availability at scale: a real-word PG DBaaS scenario

Thursday, May 28

15:20 - 16:05

Alan Franzoni