Even if it is a joke, having a consistent methodology is useful. I did it for ab...

simonw · 2025-06-08T12:48:16 1749386896

It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!

pama · 2025-06-08T13:01:26 1749387686

How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?

famouswaffles · 2025-06-08T19:02:08 1749409328

LLMs also have a 'g factor' https://www.sciencedirect.com/science/article/pii/S016028962...

MichaelZuo · 2025-06-08T13:26:06 1749389166

I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.

johnrob · 2025-06-08T15:41:01 1749397261

Well, the most likely single random sample would be a “representative” one :)

tuananh · 2025-06-08T13:48:26 1749390506

until they start targeting this benchmark

simonw · 2025-06-08T14:17:37 1749392257

Right, that was the closing joke for the talk.

jonstewart · 2025-06-08T18:13:49 1749406429

It is funny to think that a hundred years in the future there may be some vestigial area of the models’ networks that’s still tuned to drawing pelicans on bicycles.

more-nitor · 2025-06-08T13:45:21 1749390321

I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...

people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.

Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...

...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?

obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...

Breza · 2025-06-13T15:11:35 1749827495

Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.