Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.
Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)
But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.
Those costs are the actual reason you are encouraged to go multi-AZ!
(I actually love that we have strategies and infrastructure for multi-region... it just tends to come up at scales and for applications where it is not justified.)
Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.
What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.
It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".
Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.
> What's the point of cloud if we have to manage robustness of their own infrastructure.
Worth deliberating on. I’m curious as to what the lifetime cost of ownership for an on-prem data center is relative to lifetime cost of operating in the cloud.
They would simply charge for the privilege. An EC2 'always on' or whatever option that enabled your instance to live migrate between availability zones would be a nice and expensive option.
I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.
Not sure if it's still the case, but when I was there us-east-1 was a SPOF for some services world wide. I think if dynamodb went down in the region it was a big, big issue.
The only SPOF of failure I know of for us-east-1 today is the control plane for Route53 - it's distributed and DNS queries will continue to work when us-east-1 is down (including health check based failover), but you can't make any DNS changes when us-east-1 is down.
Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative
Good advice, though AWS still has some services that don't work completely independently. Cloudfront, because of certificates. Route53. The control API for IAM (adding/removing roles, etc). And I wish they didn't have global-looking endpoints (like https://sts.amazonaws.com) that aren't really global or resilient.
This. We have multi AZ in more than one region and I occasionally dream of Bezos wearing only a top hat and waistcoat laughing manically while diving into a large vat of gold coins.
Not always possible - Australia (currently) only has one availability zone and if you're in a regulated industry (banking or government stuff) they require data to be in Australia.