Since all processors must share the mapping, - The initial mapping of each file ...

prattmic · on Sept 24, 2016

    - The initial mapping of each file in any thread must halt all of the threads which are otherwise active.

    - Every page fault in any mapping must also halt all of the threads.

These are certainly not the case on Linux, and I'd imagine not on other OSes, as it would be terrible for performance.

Each mapping (i.e., mmap(2) call) is synchronized with other paths that read process memory maps, such as other mmap(2), munmap(2), etc syscalls, and page faults being handled for other threads. (i.e., mmap(2) takes the mmap_sem semaphore for writing). Running threads are not halted. The page tables are not touched at all during mmap, unless MAP_POPULATE is passed. (Linux delays actual population of the page tables until the page is accessed.)

The page fault handler takes mmap_sem for reading (synchronizing with mmap(2), etc, but allowing multiple page fault handlers to read concurrently) the mappings and page_table_lock for the very small period when it actually updates the page tables.

Again, running threads are not halted. The active page tables are updated while other cores may be accessing them. This must be done carefully to avoid spurious faults, but it is certainly feasible.

In fact, at least on x86, handling page faults does not require a TLB flush. The TLB does not cache non-present page table entries, and taking a page fault invalidates the entry that caused the fault, if one existed.

There are plenty of places here that may cause contention, but nothing nearly so bad as halting execution.

munmap will be rather noisy. It involves tons of invalidations and a TLB flush. I wouldn't be surprised if a good bit of performance could be regained by avoiding munmapping the file until the process exits.

brandmeyer · on Sept 24, 2016

So, I did some testing with 'perf'. This is on an older Intel processor, 2-cores with hyperthreading. These were all done on the same set of files, using the binary release of ripgrep v0.1.16 on Debian Jessie:

At -j1, --mmap: 95 kdTLB load-misses, 2800 page faults

At -j2: --mmap: 170 kdTLB load-misses, 2840 page faults

At -j3: --mmap: 230 kdTLB load-misses, 2800 page faults

At -j4, --mmap: 4180 context switches, 2900 page faults, 200 Minsn, 280 kDTLB load misses, 35 MDTLB loads

At -j1, --no-mmap: 50 kdTLB load misses, 635 page faults

At -j2, --no-mmap: 70 kdTLB load misses, 675 page faults

At -j3, --no-mmap: 90 kdTLB load misses, 715 page faults

At -j4, --no-mmap: 377-400 context switches, 750 page faults, 275 Minsn, 100 kDTLB load misses

As the number of threads goes up, the total amount of TLB pressure goes up in both cases. These results are consistent with a number of TLB cache flushes proportional to N_threads * M_mappings + C for the --mmap case, and N_threads * M_buffer_perthread + D for the --no-mmap case. I think that does support the model that each thread's mmap adds pressure to all of the threads TLB's.

prattmic · on Sept 24, 2016

I did some experimentation last night as well. I suspected a lot of the cost came from the unmapping the files and the required invalidations and TLB shootdowns required to do so.

I made rg simply not munmap files when it was done with them (I made this drop do nothing: https://github.com/danburkert/memmap-rs/blob/master/src/unix...)

Searching for PM_RESUME in the Linux source gave me these results:

    --no-mmap: ~400ms
    --mmap (with munmap): ~750ms
    --mmap (without munmap): ~550ms

So eliding munmap made a big difference, but it was still not enough to beat out reading the files. perf shows that the mmap syscall itself is just too expensive (this is --mmap (with munmap)):

      Children      Self  Command  Shared Object       Symbol
    -   81.88%     0.00%  rg       rg                  [.] __rust_try
       - __rust_try
          - 50.57% std::panicking::try::call::ha112cda315d6c57d
             - 47.73% rg::Worker::search_mmap::h5179a76c63e344d0
                - 23.91% __GI___munmap
                     6.08% smp_call_function_many
                     3.14% rwsem_spin_on_owner
                     1.86% native_queued_spin_lock_slowpath
                     0.94% osq_lock
                     0.67% native_write_msr_safe
                     0.52% unmap_page_range
                + 21.41% _$LT$rg..search_buffer..BufferSearcher$LT$$u27$a$C$$u20$W$GT$$GT$::run::hd0f8b2830716be0c
                  0.80% memchr
          - 17.99% __mmap64
               5.20% rwsem_down_write_failed
               1.96% rwsem_spin_on_owner
               0.77% osq_lock
               0.59% native_queued_spin_lock_slowpath
          + 6.79% 0x1080d
            1.72% __GI___libc_close
            1.27% __memcpy_sse2_unaligned
            0.98% __fxstat64
            0.56% __GI___ioctl