Author of the post (and the engineer who did the work) here. There are ways to d...

drewg123 · on Aug 1, 2018

Netflix kernel engineer here.. We use FreeBSD's async sendfile() and not aio, so it would be a bit harder for us to fix open latency, since we're not using aio.

I had not thought about open latency being an issue, that's fascinating. Looking at one of our busy 100G servers w/NVME storage, I see an openat syscall latency of no more than 8ms sampling over a few minute period, with most everything being below 65us. However, the workload is probably different than yours (more longer lived connections, fewer but larger files, fewer opens in general). Eg, we probably don't have nearly the "long tail" issue you do..

khc · on Aug 1, 2018

right I suspect you have way fewer files than we do and everything is in the dentry cache. Pretty sure that most of your files are bigger than 60KB too :-) (which is our p90)

the8472 · on Aug 1, 2018

At my job we have to open many small files from NFS. The latency of open() absolutely murders sequential performance (>80 seconds just to open a scene descriptions). Prewawrming the fairly shortlived NFS access caches in parallel evaporates most of the performance penalty.

drewg123 · on Aug 1, 2018

Wow, it is a different world :)

khc · on Aug 1, 2018

our machines also run many different services (the CDN product is just one of them) and isolating I/O from different products is difficult. I'd also love to have NVMe.

solarengineer · on Aug 2, 2018

Any chance you could use dtrace to isolate process specific I/O?

khc · on Aug 2, 2018

"isolate" here I meant prevent one process's IO from affecting another one

pg314 · on Aug 2, 2018

Have you looked into using something like SQLite instead of the filesystem? [1]

[1] https://www.sqlite.org/fasterthanfs.html

Kalium · on Aug 2, 2018

SQLite makes a ton of sense for systems that don't need to worry about concurrent writes. It's possible that a CDN's cache system might need to concern itself with concurrent writes.

khc · on Aug 2, 2018

Other than the concurrent write that another comment mentioned, looks like this test is done with a data set that fits entirely in RAM. I wish we have enough RAM for the entire internet :-(

pram · on Aug 2, 2018

TIL what a 'dentry' is!

crest · on Aug 2, 2018

A few more aio_* syscalls would really simplify things a lot. I suspect the most important missing aio_* syscalls are aio_open()/close() and aio_stat(). The semantics for async open/close would be tricky.

ioquatix · on Aug 2, 2018

8ms actually seems quite slow at that level.

drewg123 · on Aug 2, 2018

Yes, that's the absolute worst one out of a large sample size (far less than a fraction of a percent). I suspect that openat() was particularly unlucky, and interrupted multiple times..

shub · on Aug 2, 2018

IOCP doesn't do it[1]. Well, if it does then it's not documented. You can post custom completion packets so at first glance it looks easy to make open/close be async...I think there is probably a good reason why NT won't do that for you.

That's pretty awesome though, that you have to worry about latency of open().

[1] https://docs.microsoft.com/en-us/windows/desktop/fileio/i-o-...

rossy · on Aug 2, 2018

Nope, CreateFile and NtCreateFile are synchronous, only reading and writing are asynchronous.

derefr · on Aug 2, 2018

Why open(2) and close(2) all the time? If I hit this problem—and hacking on Nginx itself were an option—then I'd make the following Nginx changes:

1. at startup, before threads are spawned, find all static files dirs referenced in the config, and walk them, finding all the files in them, and open handles to all of those files, putting them into a hash-map keyed by path that will then be visible to all spawned threads;

2. in the code for reading a static file, replace the call to open(2) with a look up against the shared file-descriptor from the pool, and then a call to reopen(2) to get a separately seekable userland handle to the same kernel FD (i.e. to make the shared FD into a thread-specific FD, without having to hit the disk or even the VFS logic.)

3. (optionally) add fs-notify logic to discover new files added to the static dirs, and—thread-safely!—open them, and add them to the shared pool.

This assumes there aren't that many static files (say, less than a million.) If there were magnitudes more than that, in-kernel latency of modifying a huge kernel-side FD table might become a problem. At that point, I'd maybe consider simply partitioning the static file set across several Nginx processes on the same machine (similar to partitioned tables living in the same DBMS instance); and then, if even further scaling is needed, distributing those shards on a hash-ring and having a dumb+fast HTTP load-balancer [e.g. HAProxy] hash the requested path and route to those ring-nodes. (But at that point it you're somewhat reinventing what a clustered filesystem like GlusterFS does, so it might make more sense to just make the "TCP load-balancing" part be a regular Nginx LB layer, and then just mount a clustered filesystem to each machine in read-only-indefinite-cache mode. Then you've got a cheap, stateless Nginx layer, and a separate SAN layer for hosting the clustered filesystem, where your SSDs now live.)

khc · on Aug 2, 2018

I think you are underestimating cloudflare's scale here. Obviously we do shard across many machines but each one still has many more files than what's reasonable to keep open all the time.

lossolo · on Aug 5, 2018

This will not scale at CF and is not compatible with their current architecture.

kentonv · on Aug 2, 2018

The use case here isn't static files, it's an HTTP cache.

jasode · on Aug 2, 2018

>Does NT allow you to open a file in an async fashion?

No.

As a side note, there was an interesting old hn subthread about async disk i/o of philosophies Windows NT vs Linux:

https://news.ycombinator.com/item?id=11865760