Hardware Failure on EC2

znpy · on April 17, 2021

This is not a news really, if anything a chance to express a consideration about "the cloud" in general.

Hardware is fairly reliable nowadays (whether you're hosting your own or using the cloud provider's) but failure still happens.

The bonus that you get by using a cloud provider is that you get API machinery and tooling to automatically handling most failures.

The malus is that depending on your level of paranoia (eg: architecting for multi-region) redundancy will get very expensive very soon.

But in the end... meh.

Despite what "evangelists" or "detractors" will say, there's no free lunch.

ihumanable · on April 17, 2021

Having built out big infra on AWS and now on GCP, GCP's live migrations really change the way I think about the cloud.

GCP has hardware failures, and if there's a bad enough one it impacts your instance, but you can have instances that live forever and they just move it from physical machine to physical machine and it mostly works.

I was very skeptical about this, having spent 5-6 years on AWS and got used to the idea that AWS might just blow your instance away (normally with notice, sometimes not though if the hardware hosting it had some catastrophic failure). GCP just says, "Don't worry about it, things will get moved" and the servers in question have long lived connections, they aren't just simple stateless HTTP API servers, and it works 99.9% of the time (that last 0.1% of the time is a real pain to debug).

We still treat our machines as cattle instead of pets, but not having to constantly deal with cattle roaming off the reservation is nice.

brianwawok · on April 18, 2021

This is neat and something I really like about GCP. Surprised AWS hasn’t been able to replicate this yet.

Yes your code should handle an instance dying. But if 98% of the time you can live migrate and not... hey why not.

bradknowles · on April 22, 2021

VMware can handle live migration. I know, I worked there for a while and that was one of the key things we sold as our stuff being better than “the cloud”.

AWS has had a partnership with VMware for a while now, so I would imagine/hope that they could still do live migration, even when running in the AWS cloud.

That does make me wonder why AWS doesn’t implement the same functionality native, though.....

phamilton · on April 17, 2021

We have switched to using spot instances for our production traffic. If you diversify across instance types and AZs, there's always plenty of capacity.

The result is, as one engineer on our team says, you basically run chaos monkey but amazon pays you to do so.

bdcravens · on April 17, 2021

I've run a ton of background processing through spot (generally a couple of hundred instances at a time) for many years, and I think I've only lost all capacity once (and that was before fleet)

cellover · on April 17, 2021

Totaly off topic but this made my day: "Elastic lad balancing"

mhh__ · on April 17, 2021

I wish IBM had a few competitors because I think the industry would be interesting if the mainframe approach to reliability was also available, but I think you'd be an idiot to even dip your finger in their kool-aid as of now.

Salgat · on April 18, 2021

Is it even possible for a VM in realtime to failover to another server in the case of a hardware failure? I'm talking where the failure is unpredictable and the VM can't be snapshotted before transfer.

lillecarl · on April 18, 2021

It's possible but VERY expensive and severely limited.[1] What is possible and usually acceptable though is shared storage. So another host will spin up the machine for you and you'll have encountered an "unexpected reboot". More often than not monitoring will tell you something is wonky with the server in advance, so you can live migrate to another host and lose nothing "nothing", fix or retire the machine and problem solved.

1: https://www.vmware.com/products/vsphere/fault-tolerance.html

bdcravens · on April 17, 2021

needs (2012)

doanerock · on April 17, 2021

When there is a hardware failure, do you have to rebuild or does the original vm come back in minutes/hours/days?

bdcravens · on April 17, 2021

I believe the instance goes away. If you've persisted data to EBS, you can attach the volume to a different instance. Anything outside of EBS is lost. However, if you created an AMI (AWS Machine Image), you can launch a new instance from that. (but it's a snapshot, not a live image, so it'll be only as up to date as it was when saved)

ncsurfus · on April 17, 2021

The instance isn’t terminated (gone). It’s either shutdown or rebooted depending on the type of hardware failure.

bdcravens · on April 17, 2021

It doesn't get transferred to a different physical instance, right?

christrotman · on April 18, 2021

I have never seen it migrated to a different instance type, but it would be running on a different physical server.

NikolaeVarius · on April 17, 2021

Why would anyone even care

bdcravens · on April 17, 2021

I think many would make assumptions about the state of the instance and their application, which are likely to be wrong.

ncsurfus · on April 20, 2021

Some virtualization platforms support live migrations (hyper-v and ESXi), but this isn’t something AWS supports (afaik). Anytime your instance is stopped/started or rebooted then you may end up on a different physical host.