I'd like to see support for optimized multiplication on symmetric matrices. I do...

santaclaus · on Aug 28, 2016

> I don't think BLAS can take advantage of that

There is BLAS level 2 (f|d)symv for matrix-vector multiplies and BLAS level 3 (f|d)symm for matrix-matrix multiplies. Last time I benchmarked symv, it was slower than the general implementation by around 25%...

sevenless · on Aug 29, 2016

Any idea why it would be slower? I liked the idea of avoiding half the multiplications.

Someone · on Aug 29, 2016

Disclaimer: I have very, very, very little experience using BLAS. The reasons I post this are:

- the original poster gave an unqualified speed difference, which cannot reasonably be the full story. They likely left out information such as a 'for my use case' clause.

- I was curious, too, but couldn't Google benchmarks.

Having said that, my guess would be that it is slower for small matrices (where algorithm overhead plays a role), but faster for larger ones (where speed probably is proportional to memory access speed times amount of data accessed). There's a similarity here with searching a sorted array. There, a linear search is faster than a binary search up to a surprisingly large N.

I wouldn't dare guess where the cut-off point lies, but it likely lies at a point above where a matrix row fills a cache line (below that, reading only a few entries of a row brings in an entire row, anyways). For a level 1 cache line of 64 bytes, for floats, that would be a 16x16 matrix.