Hacker Newsnew | past | comments | ask | show | jobs | submit | minadotcom's commentslogin

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?



This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI


And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.


agreed.


The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)


uh oh, where did SWE bench go :D


maybe they will release with gpt-5.2-codex


The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.


They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.


OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.


https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.


“You are absolutely right, and I apologize for the confusion.”


Tax advisor here! There is soo much potential for this tool. I work on this specific problem (for a multinational startup), and we always resort to manual research (or now LLMing our way toward interesting constructions). With this tool I can basically create a simple "starting point" for finding intelligent routes!

This deserves much more attention than it has received right now! I would consider reposting this during the weekend if I were you.


Thx for the nice words! I appreciate it. I have noticed the same: it's hard to get some exposure on HN. I used to frequent HN under a different username in my early teens, and back then (2010s) it was so much easier to get some exposure. I guess the problem is that the community has grown slightly too large. Furthermore, it's also a little bit too easy to create an account. I think if HN would resort to an invite-only approach for posting stuff, individuals would get much more exposure. But then again, getting the content out there would be difficult. I guess I will repost this in the coming days, on an "ideal" day.

If you like the tool, feel free to sign up on the TreatyHopper+ form (it's on the top right corner of the website), then I can gauge the interest from withint the industry. And it also gives some extra motivation :)


Great read! Ngl, I generally like Dr. Cook's bite-sized approach to content: it's a quick read and you learn something new.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: