- The initial mapping of each file in any thread must halt all of the threads which are otherwise active.
- Every page fault in any mapping must also halt all of the threads.
Worse, since the page tables are getting munged, some or all of the TLB cache is getting flushed every time, again, on every processor.
I'm not sure of the details, but this hypothesis should be directly testable. IIRC, there are some hardware performance counters for time spent waiting on TLB lookups.
Addendum: One other possibility is that the mere act of extending the working set size (of the address space) is blowing the TLB cache.
- The initial mapping of each file in any thread must halt all of the threads which are otherwise active.
- Every page fault in any mapping must also halt all of the threads.
These are certainly not the case on Linux, and I'd imagine not on other OSes, as it would be terrible for performance.
Each mapping (i.e., mmap(2) call) is synchronized with other paths that read process memory maps, such as other mmap(2), munmap(2), etc syscalls, and page faults being handled for other threads. (i.e., mmap(2) takes the mmap_sem semaphore for writing). Running threads are not halted. The page tables are not touched at all during mmap, unless MAP_POPULATE is passed. (Linux delays actual population of the page tables until the page is accessed.)
The page fault handler takes mmap_sem for reading (synchronizing with mmap(2), etc, but allowing multiple page fault handlers to read concurrently) the mappings and page_table_lock for the very small period when it actually updates the page tables.
Again, running threads are not halted. The active page tables are updated while other cores may be accessing them. This must be done carefully to avoid spurious faults, but it is certainly feasible.
In fact, at least on x86, handling page faults does not require a TLB flush. The TLB does not cache non-present page table entries, and taking a page fault invalidates the entry that caused the fault, if one existed.
There are plenty of places here that may cause contention, but nothing nearly so bad as halting execution.
munmap will be rather noisy. It involves tons of invalidations and a TLB flush. I wouldn't be surprised if a good bit of performance could be regained by avoiding munmapping the file until the process exits.
So, I did some testing with 'perf'. This is on an older Intel processor, 2-cores with hyperthreading. These were all done on the same set of files, using the binary release of ripgrep v0.1.16 on Debian Jessie:
At -j1, --mmap: 95 kdTLB load-misses, 2800 page faults
At -j2: --mmap: 170 kdTLB load-misses, 2840 page faults
At -j3: --mmap: 230 kdTLB load-misses, 2800 page faults
As the number of threads goes up, the total amount of TLB pressure goes up in both cases. These results are consistent with a number of TLB cache flushes proportional to N_threads * M_mappings + C for the --mmap case, and N_threads * M_buffer_perthread + D for the --no-mmap case. I think that does support the model that each thread's mmap adds pressure to all of the threads TLB's.
I did some experimentation last night as well. I suspected a lot of the cost came from the unmapping the files and the required invalidations and TLB shootdowns required to do so.
So eliding munmap made a big difference, but it was still not enough to beat out reading the files. perf shows that the mmap syscall itself is just too expensive (this is --mmap (with munmap)):
- The initial mapping of each file in any thread must halt all of the threads which are otherwise active.
- Every page fault in any mapping must also halt all of the threads.
Worse, since the page tables are getting munged, some or all of the TLB cache is getting flushed every time, again, on every processor.
I'm not sure of the details, but this hypothesis should be directly testable. IIRC, there are some hardware performance counters for time spent waiting on TLB lookups.
Addendum: One other possibility is that the mere act of extending the working set size (of the address space) is blowing the TLB cache.