The cost of bouncing a cache line depends less on the number of cores and more on the number of sockets. Since the future of desktops is clearly 1-socket machines, BFS is optimized for that and it's not representative to test with a 2-socket machine. IOW, today's expensive 8-core 2S workstation is not a good proxy for tomorrow's cheap 8-core 1S desktop.
Interesting point -- I've never written a scheduler. Is it really true that writing a scheduler for the multi-socket, multi-core case is that much more complicated than focusing on single-socket, multi-core? Seems possible to me. However, doesn't the existence of hyperthreading contradict that? Even for the single-socket, multi-core case, you need to account for the fact that the virtual processors are not identical.
I don't know that it's a matter of difficulty, just design tradeoffs. For a single socket you can probably get away with a single runqueue; there will be contention but it won't be that expensive. For multi-socket you want to have multiple runqueues; this scales better but you have to load balance threads between queues.
And there are affinity issues to consider as well.
Writing a scheduler is not for the faint of heart.
What bothers me about this whole affair is that the egos of both parties seem to preclude them from working together to get to a 'best of breed' situation.