yep, ran a controlled experiment on 28 tasks comparing old opus 4.6 vs new opus 4.6 vs 4.7, and found that 4.7 is comparable in cost to old 4.6, and ~20% more expensive then new 4.6 (because new 4.6 is thinking less)
A fun conspiracy theory I have is that Mythos isn’t actually dangerous in any serious sense. They just can’t reliably serve a 10T model. So they have to make up a reason to limit customers.
coming more in line with codex - claude previously would often ignore explicit instructions that codex would follow. interested to see how this feels in practice
I think this line around "context tuning" is super interesting - I see a future where, for every model release, devs go and update their CLAUDE.md / skills to adapt to new model behavior.
working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks
also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
a bit heavier weight, but seems worthwhile if working in an org where many people consume the skill:
- find N tasks from your repo that serve as good representation of what you want the agent to do with the task
- run agent with old skill/new skill against those tasks
- measure test pass rate / other quality metrics that you care about with skill
- token usage, speed, alignment, ...
- tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator
- use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking
I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o
cost control is a policy problem - we certainly don't need to use opus 4.6 for a simple test refactor, but many people (including myself) default to it anyways. we need a way to measure cost / performance for agents on individual repos, with individual types of tasks, to get a better sense of what tasks can be trusted to cheaper agents, and what tasks must be routed to the SOTA
managing agents.md is important, especially at scale. however I wonder how much of a measurable difference something like this makes? in theory, it's cool, but can you show me that it's actually performing better as compared to a large agents.md, nested agents.md, skills?
more general point being that we need to be methodical about the way we manage agent context. if lat.md shows a 10% broad improvement in agent perf in my repo, then I would certainly push for adoption. until then, vibes aren't enough
I'm working on a blog post and on benchmarks. Here [1] Armin suggested I take something like quickjs, built lat base for it, and compare side by side how, say, claude code works with lat vs. without.
I'm very early into this and need to build proper harness, but I can see sometimes lat allowing for up to 2x faster coding sessions. But the main benefit to me isn't speed, it's the fact that I can now review diffs faster and stay more engaged with the agent.
Very cool, interested to read more once you post! FWIW I've been building eval infras that does something adjacent/related — replaying real repo work against different agent configs, and measuring the agent's quality dimensions (pass/fail, but also human intent alignment, code review, etc.). If you want to compare notes on the harness design, or if having an independent eval of lat vs. no-lat on quickjs would be useful, happy to chat :)
I'm also thinking on how we can put guardrails on Claude - but more around context changes. For example, if you go and change AGENTS.md, that affects every dev in the repo. How do we make sure that the change they made is actually beneficial? and thinking further, how do we check that it works on every tool/model used by devs in the repo? does the change stay stable over time?
Given the scope that AGENTS has, I would use PRs to test those changes and discuss them like any other large-impact area of the codebase (like configs).
If you wanted to be more “corporate” about it, then assuming that devs are using some enterprise wrapper around Claude or whatever, I would bake an instruction into the system prompt that ensures that AGENTS is only read from the main branch to force this convention.
This is harder to guarantee since these tools are non-deterministic.
PRs for AGENTS.md are necessary, but not sufficient, exactly because of non-determinism. You can LGTM the AGENTS.md change, but it's so hard to know what downstream behavioral effects it has. I feel like the only way to really know is by building a benchmark on your repo, and actually A/B testing the AGENTS.md change. I'm building something in the space - happy to share if it's something that sounds interesting to you
I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.
also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing
However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.
What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.
Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further
Really interesting study. One thing I keep coming back to is that tests have no way of catching this sort of tech debt. The agent can introduce something that will make you rip your hair out in 6 months, but tests are green...
My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?
https://www.stet.sh/blog/opus-4-7-zod
reply