minadotcom's comments

minadotcom · 2025-12-11T18:29:30 1765477770

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?

Tiberium · 2025-12-11T18:35:17 1765478117

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

enlyth · 2025-12-11T19:22:39 1765480959

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

tobias2014 · 2025-12-12T01:41:25 1765503685

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

minadotcom · 2025-12-11T21:24:41 1765488281

agreed.

sergdigon · 2025-12-12T07:20:32 1765524032

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

whimsicalism · 2025-12-11T22:27:00 1765492020

uh oh, where did SWE bench go :D

whimsicalism · 2025-12-12T02:11:06 1765505466

maybe they will release with gpt-5.2-codex

tabletcorry · 2025-12-11T18:32:14 1765477934

The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

Workaccount2 · 2025-12-11T20:38:46 1765485526

They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.

poormathskills · 2025-12-11T18:30:52 1765477852

OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.

boole1854 · 2025-12-11T19:11:28 1765480288

https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

kgwgk · 2025-12-11T19:30:27 1765481427

“You are absolutely right, and I apologize for the confusion.”

minadotcom · 2025-12-11T16:15:41 1765469741

Tax advisor here! There is soo much potential for this tool. I work on this specific problem (for a multinational startup), and we always resort to manual research (or now LLMing our way toward interesting constructions). With this tool I can basically create a simple "starting point" for finding intelligent routes!

This deserves much more attention than it has received right now! I would consider reposting this during the weekend if I were you.

realberkeaslan · 2025-12-11T16:22:33 1765470153

Thx for the nice words! I appreciate it. I have noticed the same: it's hard to get some exposure on HN. I used to frequent HN under a different username in my early teens, and back then (2010s) it was so much easier to get some exposure. I guess the problem is that the community has grown slightly too large. Furthermore, it's also a little bit too easy to create an account. I think if HN would resort to an invite-only approach for posting stuff, individuals would get much more exposure. But then again, getting the content out there would be difficult. I guess I will repost this in the coming days, on an "ideal" day.

If you like the tool, feel free to sign up on the TreatyHopper+ form (it's on the top right corner of the website), then I can gauge the interest from withint the industry. And it also gives some extra motivation :)

minadotcom · 2025-12-11T16:11:31 1765469491

Great read! Ngl, I generally like Dr. Cook's bite-sized approach to content: it's a quick read and you learn something new.