It’s a bit more fundamental than just "using it badly." The real tension lies in whether a language's safety invariants force a memory layout that is inherently at odds with the CPU cache hierarchy.
In low-latency systems, the true "tax" is often the loss of determinism. If I have to sacrifice a cache-friendly structure or introduce indirection just to satisfy a borrow checker's static analysis, the performance game is already lost, regardless of how "well" I use the language.
To give a concrete example: I previously built a high-frequency bridge for MT4 using a strict Modern C++ stack. I observed that after the initial warm-up, the working set actually settled from 13.6MB down to a stable 11.0MB and stayed there for a 7-day continuous stress test.
This 2.6MB drop was simply the OS reclaiming initialization overhead—a result of manual memory management (via custom pool allocators) preventing heap fragmentation from "pinning" that memory. You don't achieve that level of long-term residency stability by just "using a language well"; you get it by using a toolchain that allows you to treat the hardware as the ultimate source of truth.