I think graceful restarts are fine for servers that offer web pages or other things where a client will not retry a request.
For all other cases[1], I think graceful restarts are a bad practice. When you use that pattern, you come to rely on your server closing down cleanly. It makes you consider abrupt terminations (pulling a powerplug, kill -9, network partitions) an edge case.
Because you separate 'shutdown cleanly' from 'shutdown abruptly' and that in most deployments clean shutdowns are much more frequent than abrupt ones, you end up not exercising the abrupt termination case very often. This means that your system's reaction to abrupt terminations is likely buggy or neglected. If it works fine because you took great care of handling that case, the effect is that your codebase is now more complex. You have two code paths to handle termination: a clean path and a dirty path.
If instead, you shutdown only one way (abruptly), those two cases become the same. You reduced the problem space and now your code shuts down only 1 way. You can be confident that your system is resilient to abrupt failures, since you exercise that shutdown path every single time you stop your server. Some call this crash-only software.
A caveat with this is that clients must be able to handle abrupt terminations. This usually means that clients must have the ability to retry (with exp-backoff) their operations, and that each operation they perform has a timeout.
How To Know If You're Infected By The Graceful Shutdown Fever:
Assume that your servers are `kill -9` every 2s at
random, are you confident that your system will keep
working properly?
[1]: another valid case I found was in distributed systems that require a certain part of their members to be up at any time. If the nodes coordinate their restarts, they can ensure that a minimum number of nodes are kept healthy.
[2]: I've spent many hours arguing this with very smart colleagues who disagree, so this is a bit of an opinionated view (still I think I'm right =])
I think in the case of web servers (that are most times stateless), guaranteeing it will survive a "kill -9" just means having some sort of supervisor service.
However there are times you want a graceful restart because an abrupt restart affects your client's UX - such as during a file upload, or where a client may be running a long running synchronous operation (like a credit card charge). In both cases the system may be resilient against a -9 and retries, but annoying if "the world stops" because of a change to prod.
A supervisor service boils down to the same issue if there's a network partition or if the machine just blows up, a disk fails, the kernel panics, the power supply dies, etc. The thing is you can't guarantee that any process will survive all a request, so you might as well work as if you were living in Hell and every request has a 0.5 probability of failing (by any mean).
Right, which are cases where a graceful restart will never help you. If a user is uploading a 4K video file and the server dies because god said so, well they can just try again.
If the server dies because the dev team decides to push to prod 4x in an hour and the user has to retry every time, a graceful restart lets you do that without pissing off anyone who was using that live connection.
For all other cases[1], I think graceful restarts are a bad practice. When you use that pattern, you come to rely on your server closing down cleanly. It makes you consider abrupt terminations (pulling a powerplug, kill -9, network partitions) an edge case.
Because you separate 'shutdown cleanly' from 'shutdown abruptly' and that in most deployments clean shutdowns are much more frequent than abrupt ones, you end up not exercising the abrupt termination case very often. This means that your system's reaction to abrupt terminations is likely buggy or neglected. If it works fine because you took great care of handling that case, the effect is that your codebase is now more complex. You have two code paths to handle termination: a clean path and a dirty path.
If instead, you shutdown only one way (abruptly), those two cases become the same. You reduced the problem space and now your code shuts down only 1 way. You can be confident that your system is resilient to abrupt failures, since you exercise that shutdown path every single time you stop your server. Some call this crash-only software.
A caveat with this is that clients must be able to handle abrupt terminations. This usually means that clients must have the ability to retry (with exp-backoff) their operations, and that each operation they perform has a timeout.
[1]: another valid case I found was in distributed systems that require a certain part of their members to be up at any time. If the nodes coordinate their restarts, they can ensure that a minimum number of nodes are kept healthy. [2]: I've spent many hours arguing this with very smart colleagues who disagree, so this is a bit of an opinionated view (still I think I'm right =])