I'd like to see the worst-case behaviour too - e.g. hilariously wrong results. S...

I'd like to see the worst-case behaviour too - e.g. hilariously wrong results. Seeing the failures (and their rate) makes it more believable and enables a more unbiased evaluation of the capabilities of the system.

It's like those "up to XX% better" claims - "up to", not "at least" being the key phrase here.