If you were as good as you claim, you would have directly answered my argument i...

BubRoss · on June 13, 2019

There is no need to be upset, there is no real finality here, everything has to be measured.

That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.

New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by due to allocating new memory, writing to it and eventually getting it to another CPU.

If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.

reitzensteinm · on June 14, 2019

If you took time to read the code linked, you'd notice two things:

1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...

2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns

So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.

Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.

I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.

BubRoss · on June 14, 2019

> L3 latency you're so worried about

I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.

Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.

The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.

> SPSC ring buffers

That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.

reitzensteinm · on June 14, 2019

Are you claiming the act of writing bytes to a ring buffer as being memory allocation? In that case I misunderstood what you were saying and it was indeed a straw man.

In any case, we're clearly not going to find common ground here.