All work
Multi-region failover for a global API
Active-active across three regions with automatic failover measured in seconds, not hours.
<30s
recovery time
3
active regions
99.995%
availability
## The challenge
A single-region API meant a single point of failure — and an SLA the team couldn’t honestly promise.
## What we did
- Adopted CockroachDB for a geo-distributed, strongly consistent data layer.
- Fronted services with Envoy and health-aware routing for instant traffic shifting.
- Ran quarterly game-days that physically pulled a region to prove the numbers.
## Outcome
Failover now completes in under 30 seconds, unattended, and the platform holds a verified 99.995% availability.