Tweet by Corey Quinn
"Well were things more reliable before @awscloud?" No! Good lord no! The difference is that I could have a bad day and take down a hospital. AWS has a bad day and takes down all the hospitals.
It's the simultaneous outage of everything that's the problem.
The worst part is that I don't even have the slightest clue how to fix it. You can plan and plan and plan around this. You can build out multi-region or multi-cloud until the cows come home.
And then one of your third parties did none of this and you're just as down.
A multi-day full outage of us-east-1 will have an observable effect on the world economy.
That is not an exaggeration. I don't know how to fix any of this. I just know that we should be talking about it.
And also in slack.lastweekinaws.com we confirmed the root cause of today's outage: the Managed NAT Gateways in us-east-1 overflowed and jammed ujp with money.
Yeah, this doesn't work. I assure you, no federal regulation or proposed penalty is going to make @awscloud say "oh, outages are bad, we should be more careful." They already say that! Constantly!
Today's event wasn't from a lack of care or diligence.
To be explicit, I don't think AWS has done anything wrong here. This is the natural end result of their success at massive scale.