My thoughts after this week’s AWS outage:

It’s always DNS.
Building resilience against an entire AWS region going down isn’t worth it for most companies.

Early in my career as a DBA, I was tasked with improving availability for a critical database. I added a hot failover and started pricing out full clustering when my account rep quietly pulled me aside and told me:

“You probably don’t want clustering. It’s expensive, complex, and will likely cause more downtime than it prevents.”

It was a hot take (and they were turning down a sale) - but they were right.

Going multi-cloud or multi-region isn’t just expensive, it’s complicated. How does that cost compare to a rare regional outage? How much downtime will you incur implementing it? How many more engineers will you need to support it?

Sometimes the best business decision is to accept that rare outages will happen - and to have the conversation with stakeholders before your region goes down.