The inherent problem with evaluating coding performance of models remains: most ...

HarHarVeryFunny · 2026-02-12T15:14:14 1770909254

Yes, it seems the open benchmark results that are normally reported, such as SWE-bench, SWE-bench Verified, and Terminal-bench, aren't really that indicative of success in more general use cases.

According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.