Think long and hard about this one. Multiple times in my three-decade career I've seen automated mitigations make the problem worst. So really consider whether self-healing is something you need.
I built my company's in-house mobile crash reporting solution in 2014. Part of the backend has had one server running Redis as a single point of failure. The failover process is only semi-automated. A human has to initiate it after confirming alerts about it being down are valid. There's also no real financial cost to it going down - at worst my company's mobile app developers are inconvenienced for a bit.
In the decade the system has been operational I can count on two fingers the number of times I've had to failover.
Despite this system having no SLA it's had better uptime than much more critical internal systems.
To be fair, GitHub operates at a much larger scale. My point is only that redundancy and automated mitigations add complexity and are almost by definition rarely tested and operate under unforeseen circumstances.
So really, consider your SLA and the cost of an outage and balance that against the complexity you'll add by guarding against an outage.
I think my first introduction to this was circa 1998 when I had a pair of NetApps clustered into an HA configuration and one of them failed and caused the other to corrupt all its disks. Fun times. A similar thing happened around the same time with a pair of Cisco PIX firewalls. I've been leery of HA and automated failover/mitigations ever since.
Think long and hard about this one. Multiple times in my three-decade career I've seen automated mitigations make the problem worst. So really consider whether self-healing is something you need.
I built my company's in-house mobile crash reporting solution in 2014. Part of the backend has had one server running Redis as a single point of failure. The failover process is only semi-automated. A human has to initiate it after confirming alerts about it being down are valid. There's also no real financial cost to it going down - at worst my company's mobile app developers are inconvenienced for a bit.
In the decade the system has been operational I can count on two fingers the number of times I've had to failover.
Despite this system having no SLA it's had better uptime than much more critical internal systems.
Conversely:
https://github.blog/2023-05-16-addressing-githubs-recent-ava...
https://github.blog/2018-10-30-oct21-post-incident-analysis/
https://www.datacenterknowledge.com/archives/2012/12/27/gith...
To be fair, GitHub operates at a much larger scale. My point is only that redundancy and automated mitigations add complexity and are almost by definition rarely tested and operate under unforeseen circumstances.
So really, consider your SLA and the cost of an outage and balance that against the complexity you'll add by guarding against an outage.
I think my first introduction to this was circa 1998 when I had a pair of NetApps clustered into an HA configuration and one of them failed and caused the other to corrupt all its disks. Fun times. A similar thing happened around the same time with a pair of Cisco PIX firewalls. I've been leery of HA and automated failover/mitigations ever since.