Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Skylake-X L3 latency is ~20ns. So if you build an SPSC queue between them

The latency is mostly about initial cache misses. There is no reason to take the time to write out a buffer of samples to memory, only to have another CPU access them with a cache miss. One of many things things you are missing here is prefetching. Instructions will be heavily prefetched as will samples when accessesed in any sort of linear fashion.

Also you can't explicit use caches or send data between them, that is going to be up to the CPU, and it will use the whole cache heirarchy.

> You take an absolutist view

Everything dealing with performance needs to be measured, but I have a good idea of how things work so I know what to prioritize and try first. Architecture is really the key to these things and in my replies I've illustrated why.

> Maybe I know about high performance code, maybe I don't

It sounds like you have read enough, but haven't necessarily gone through lots of optimizations and recitified what you know with the results of profiling. Understanding modern CPUs is good for understanding why results happen, but less so for estimating exactly what the results will be when going in blind.

> maybe you do, maybe you don't

I've got a decent handle on it at this point.



If you were as good as you claim, you would have directly answered my argument instead of hitting a strawman for five paragraphs.

Your experience led to overconfidence and you identified a ridiculous bottleneck for the problem domain. This is complicated and FPU heavy code running on few pieces of tiny data. And yes, riddled with LUTs. The latency cost you're worried about is in the noise.

Instead of doing some back of the envelope calculations and realizing your mistake, you double down, handwave and smugly attack me.

Your conclusions are bullshit, as is your evaluation of my experience. For anyone else that happens to be reading, I suggest taking a look through the source of a few plugins and judging for yourself.

https://vcvrack.com/plugins.html


There is no need to be upset, there is no real finality here, everything has to be measured.

That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.

New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by due to allocating new memory, writing to it and eventually getting it to another CPU.

If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.


If you took time to read the code linked, you'd notice two things:

1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...

2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns

So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.

Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.

I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.


> L3 latency you're so worried about

I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.

Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.

The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.

> SPSC ring buffers

That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.


Are you claiming the act of writing bytes to a ring buffer as being memory allocation? In that case I misunderstood what you were saying and it was indeed a straw man.

In any case, we're clearly not going to find common ground here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: