Grace: Graceful restart for Go servers

AYBABTME · on June 11, 2015

I think graceful restarts are fine for servers that offer web pages or other things where a client will not retry a request.

For all other cases[1], I think graceful restarts are a bad practice. When you use that pattern, you come to rely on your server closing down cleanly. It makes you consider abrupt terminations (pulling a powerplug, kill -9, network partitions) an edge case.

Because you separate 'shutdown cleanly' from 'shutdown abruptly' and that in most deployments clean shutdowns are much more frequent than abrupt ones, you end up not exercising the abrupt termination case very often. This means that your system's reaction to abrupt terminations is likely buggy or neglected. If it works fine because you took great care of handling that case, the effect is that your codebase is now more complex. You have two code paths to handle termination: a clean path and a dirty path.

If instead, you shutdown only one way (abruptly), those two cases become the same. You reduced the problem space and now your code shuts down only 1 way. You can be confident that your system is resilient to abrupt failures, since you exercise that shutdown path every single time you stop your server. Some call this crash-only software.

A caveat with this is that clients must be able to handle abrupt terminations. This usually means that clients must have the ability to retry (with exp-backoff) their operations, and that each operation they perform has a timeout.

    How To Know If You're Infected By The Graceful Shutdown Fever: 
    
    Assume that your servers are `kill -9` every 2s at 
    random, are you confident that your system will keep 
    working properly?

[1]: another valid case I found was in distributed systems that require a certain part of their members to be up at any time. If the nodes coordinate their restarts, they can ensure that a minimum number of nodes are kept healthy. [2]: I've spent many hours arguing this with very smart colleagues who disagree, so this is a bit of an opinionated view (still I think I'm right =])

nemothekid · on June 11, 2015

I think in the case of web servers (that are most times stateless), guaranteeing it will survive a "kill -9" just means having some sort of supervisor service.

However there are times you want a graceful restart because an abrupt restart affects your client's UX - such as during a file upload, or where a client may be running a long running synchronous operation (like a credit card charge). In both cases the system may be resilient against a -9 and retries, but annoying if "the world stops" because of a change to prod.

AYBABTME · on June 11, 2015

A supervisor service boils down to the same issue if there's a network partition or if the machine just blows up, a disk fails, the kernel panics, the power supply dies, etc. The thing is you can't guarantee that any process will survive all a request, so you might as well work as if you were living in Hell and every request has a 0.5 probability of failing (by any mean).

nemothekid · on June 12, 2015

Right, which are cases where a graceful restart will never help you. If a user is uploading a 4K video file and the server dies because god said so, well they can just try again.

If the server dies because the dev team decides to push to prod 4x in an hour and the user has to retry every time, a graceful restart lets you do that without pissing off anyone who was using that live connection.

AYBABTME · on June 12, 2015

Agreed, that's what I meant in:

    I think graceful restarts are fine for servers that offer 
    web pages or other things where a client will not retry a
    request.

+ when retrying a request is not practical.

zimbatm · on June 10, 2015

A couple of my own also written in Go. The main difference is that control is external which makes them language-agnostic:

* https://github.com/zimbatm/socketmaster - very simple

* https://github.com/pusher/crank - adds a control socket, dynamic config and more fine-grained controlled restarts

Both are designed to sit on top of your process manager of choice. Unlike self-forking process solutions they don't change PID so it works great with systemd and upstart.

tobz · on June 10, 2015

Just commenting to say I used this library for a project at work to great effect. I tried rcrowley/goagain, which had issues and felt like I was shoehorning everything in. I tried another that I can't remember.

There's a few that are drop-in replacements for http.ListenAndServe, which is great, until you want to do something custom, then it gets a lot more kludgy. Grace is not kludgy.

And, most importantly, it definitely works. Was able to hammer the bejeezus out of my daemon, restarting it willy nilly, without issue or fanfare.

vezzy-fnord · on June 10, 2015

Additionally it is implemented using the same API as systemd providing socket activation compatibility to also provide lazy activation of the server.

Couldn't immediately tell from reading the source - but is this to say that it uses the sd-* APIs directly, that it implements behavior compatible with sd_listen_fds, or some other form of "socket activation" like superserver or fd-holding?

daakus · on June 10, 2015

Largely it just means that it uses the `LISTEN_FDS` environment variable and passes the sockets down starting at fd=3 (more here: https://github.com/facebookgo/grace/blob/master/gracenet/net...). Admittedly our systemd use case is no more and I haven't tested with it recently so I want to clarify I should probably re-test and make sure it still works.

shockzzz · on June 10, 2015

This line is indicative of what I hate about Golang code:

a := newApp(servers)

https://github.com/facebookgo/grace/blob/master/gracehttp/ht...

EDIT: read my comment below where I elaborate

https://news.ycombinator.com/item?id=9695880

ejcx · on June 10, 2015

Why? Is this really a golang thing?

If you are going to be creating these complex structures why not using a factory function for it? If there were many creations of this type it could severely increase the bulk of the code....

bennyp101 · on June 10, 2015

But the 's' didn't bother you?! for _, s := range a.sds {

rylee · on June 10, 2015

That honestly looks like keyboard smashing. I love it.

teraflop · on June 10, 2015

Your comment could use a lot more elaboration.

Are you saying you dislike single-character variable names, and that you see them frequently in Go?

nvader · on June 10, 2015

I'm baffled by this comment. What does this line do that you hate so much?

shockzzz · on June 10, 2015

It's not what this line does, but rather what it doesn't do.

In the context of reading someone else's source code, the variable name "a" is inadequate, and I've seen it in many Golang codebases. It's like intentionally obtuse code.

Maybe I haven't read enough Go code in the wild, but it really feels like the C days. Yes, it's GC'd, and that makes a difference. Yet I can't help but wonder how much easier it is to read code from the Python or Ruby communities. I'd even go as far to say that it's easier to read code from the Javascript community.

AYBABTME · on June 11, 2015

`a` is not the bestest name in that func, but it's internal (not visible to clients) and it's fairly obvious what it does in the context. Reading Go code everyday, I'm quite used to short variable names like that and I don't find it confusing. Although I did read code in the stdlib that's all bunch of 1-letter vars and was having a hard time following, and in that example you point out, I'd have prefered `app` to `a`.

cyri · on June 10, 2015

Maybe this presentation may help you:

http://talks.golang.org/2014/names.slide#1

shockzzz · on June 10, 2015

Doesn't make short variable names right. How do long variable names obscure code!! That's insane. There are many, many ways to obscure code, and there many ways to make it clear. Long-names is one of them.

ejcx · on June 10, 2015

Overly long variable names is just as much of a hinderence as variable names that don't give enough info.

Your "complaint" of go actually have nothing to do with Go itself...

In my opinion, a is an great variable name. Especially when it is used after calling a function like newApp(...). I may have picked a different name for newApp but it was still clear to me what it did.

laumars · on June 10, 2015

I think calling it "a great variable name" is being a touch excessive. You are right that that it's clear from the code what it is, but it's also definitely not a great variable name.

Personally I think single character variable names are over used in Go as well, but at least they're generally restricted to the smallest of functions (eg struct methods), or instances where a longer name wouldn't be more descriptive (eg inside for loops (for i := 0), or b[], byte arrays.).

However in this specific example, I think "a := NewApp(server)" is both readable and overly terse. So I do find myself agreeing with the OP in their example.

ejcx · on June 11, 2015

Yeah I understand. It goes both ways. It's an opinion. "Great" probably wasn't a great word to use, but I personally don't have any issue with the line of code since it was easy to understand.

I've looked at a lot of other people's Go code and I have had no issue with 1 character variable names, but maybe that is just me. The methods/functions is generally small enough that it is obvious to me what is happening.

skj · on June 11, 2015

> You are right that that it's clear from the code what it is, but it's also definitely not a great variable name.

I have two metrics for variable names. First, is it clear what the variable is? You say yes. Second, is it not obnoxious to type? Clearly not.

So, at least by my metrics, it's a great variable name.

jzelinskie · on June 10, 2015

I've been using https://github.com/tylerb/graceful for a few years now. At the time Go (1.3 IIRC) added to ability to actually access the internal connection state for connections created via net/http, it was the only library to actually do the right thing.

tobz · on June 10, 2015

FWIW, 'grace' provides graceful restarting (and graceful shutdown) while 'graceful' looks to simply provide graceful shutdown.

epoxyhockey · on June 10, 2015

Also worth a look: https://github.com/fvbock/endless

It allows for graceful restarts after replacing your binary. I'm not sure if that is a feature of grace?

stevewepay · on June 11, 2015

What is the technique to shutdown and restart a server without losing connections? Is it just making sure the connection isn't closed or reset before shutting down the server?

svieira · on June 11, 2015

A couple of different ways that I have heard of:

* SO_REUSEPORT - not reusing the socket, but allowing multiple sockets to bind to a single port. [1]

* Fork / Exec - have a parent process signal the child process to terminate after it has spun up a new version and the new version is accepting connections [2]

  [1]: https://lwn.net/Articles/542629/
  [2]: https://news.ycombinator.com/item?id=9694339

GauntletWizard · on June 11, 2015

You don't just need to preserve the socket; You also need to keep both the new and old server in memory for sufficient time that the existing connections die.

Perhaps there's a kernel-level API that could be added to allow sockets to be snatched or handed over to a new process. That is, honestly, probably the more apropos solution. The fact that sockets act as a kind of lock is an implementation detail.

AYBABTME · on June 11, 2015

Then the new born server would have to know what half-done work has been done/exchanged with the client and recover from there. This seems like an impossible challenge to me, or at least much more complex than sharing the socket while the old server finishes serving its ongoing connections while accepting no new ones.

GauntletWizard · on June 11, 2015

I think you've misunderstood me; You don't need/want to share any connections that the old process has, only the listen socket. You 'steal' the socket, and then any 'accept' calls on the previous owners FD behave as if there were no calls, while new connections show up on the thief's socket FD.

AYBABTME · on June 11, 2015

I think that's how it's currently done, but the procedure is issued by the dying server. But yes, maybe that'd be cleaner if it was a proper API call initiated by the new born process.

mholt · on June 11, 2015

I see some syscall signal stuff in there - how well does this run on Windows?

twotwotwo · on June 11, 2015

I'd be happy to be corrected, but my guess is it wouldn't compile: running `GOOS=windows godoc syscall` I don't see SIGUSR2 or Kill defined. Maybe an enterprising person could port it, if the semantics are similar enough between Windows and Unices for those APIs that exist for both (os.StartProcess, CloseOnExec, etc.).