"Context switching is expensive. My rule of thumb is that it'll cost you about 3...

rdtsc · on Aug 20, 2014

I don't trust those measurements. They never fully explored what is due to cache effects and what is due to context switching. You'd have heavy cache effects if you switched data between client contexts after select returns, too. So just because spinning on a futex show a max 30us wasted CPU doesn't mean you won't waste that with select as well.

Also in case of an IO bound thread, they are not just spinning on a futex aimlessly. There is a different mechanism, so should really benchmark with a more characteristic workload.

Speaking of characteristic workload, they should have probably also measured on a tickless kernel since I saw they complained about time quanta and HZ=100. Well recent kernels are tickless so they'll behave differently. (Might be worse even).

> On 32-bit systems, you can also easily run out of address space for your thread stacks.

Well don't run large servers with so many threads on 32 bit systems ;-) Many database vendors don't even package for or support 32 bit versions of Linux.

Sorry, I haven't bought into the whole "async is always better" trend. Some (ex?) Senior Google engineer (Paul Tyma) agrees with me:

http://www.mailinator.com/tymaPaulMultithreaded.pdf

Async / select pattern is usually good where there is very little business logic. Like a router, proxy or simple web server and so on. In a large application having a giant dispatch call at the center of it, with callbacks branching out is not a healthy pattern.

baruch · on Aug 20, 2014

In those cases I like the coroutines/user-space-threading. It gives you the reduced cost of having a single or a few threads without the heavy toll of callbacks.

baruch · on Aug 20, 2014

Just to expand on the timing and cost argument:

When you have 10,000 tasks and about 8 cores (give or take a few) the number of context switches is very large. Switching in the kernel will happen mostly in the system call boundary of blocking IOs and require the scheduler to make a decision on what thread to wake up next and then change the running process.

This can be seen in function context_switch inhttps://github.com/torvalds/linux/blob/master/kernel/sched/c... without the arch dependent components and can hardly be compared in complexity and effort to switching between 4 and 8 registers in user-space.

The above still doesn't include any changes to the TLB and memory protection tables as I assume the OS optimized those away when it switched between two threads of the same program. An optimization I'm not sure that happens normally.