I'd like to see support for optimized multiplication on symmetric matrices. I don't think BLAS can take advantage of that, certainly scipy/numpy doesn't.
There is BLAS level 2 (f|d)symv for matrix-vector multiplies and BLAS level 3 (f|d)symm for matrix-matrix multiplies. Last time I benchmarked symv, it was slower than the general implementation by around 25%...
Disclaimer: I have very, very, very little experience using BLAS. The reasons I post this are:
- the original poster gave an unqualified speed difference, which cannot reasonably be the full story. They likely left out information such as a 'for my use case' clause.
- I was curious, too, but couldn't Google benchmarks.
Having said that, my guess would be that it is slower for small matrices (where algorithm overhead plays a role), but faster for larger ones (where speed probably is proportional to memory access speed times amount of data accessed). There's a similarity here with searching a sorted array. There, a linear search is faster than a binary search up to a surprisingly large N.
I wouldn't dare guess where the cut-off point lies, but it likely lies at a point above where a matrix row fills a cache line (below that, reading only a few entries of a row brings in an entire row, anyways). For a level 1 cache line of 64 bytes, for floats, that would be a 16x16 matrix.