Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The inherent problem with evaluating coding performance of models remains: most day-to-day coding tasks are open-ended/partially-spec'd, and as such there is huge uncertainty on how the "right" solution looks.

It's very hard to rank models' solutions on such problems, which is why they rarely appear in benchmarks (I'd be glad to stand corrected).

Even Opus 4.5 coding a C compiler from scratch - jaw-dropping as it is - doesn't tell the whole story. Most of my tasks are not that well spec'd.



Yes, it seems the open benchmark results that are normally reported, such as SWE-bench, SWE-bench Verified, and Terminal-bench, aren't really that indicative of success in more general use cases.

According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: