Hacker Newsnew | past | comments | ask | show | jobs | submit | uberman's commentslogin

Fascinating read. I know nothing about any of this neither the parties involved nor Copperhead though I had heard of Graphene. To that end, I wish the response included a pre-amble for those like me who were not familiar with what was going on. I guess I could probably read the Wired article though. Still. good read and I loved the Q and A at the end.

Every tracker is likely chomping at the bit to purchase your chats. I don't know but also would not be surprised to find out that every major player already sells this data. Cloud computing at some level is just an information broker extraction tool.

I would not pool my data with others nor sell it if I knew how to prevent such things from happening though I do believe they are happening.


You are right, its likely happening. The pitch I'm chewing on is flipping that: you own the pipe, you decide if anything leaves, and if it does you get paid instead of the platform. Curious what would make you trust a setup like that, or is it a hard no regardless? Maybe you also get to see what sort of PII data you have and the platform sanitizes it...

On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms.

Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"

Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.


Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later.

I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.


I wonder if there are different use cases. You sound like you’re using an LLM in a similar way to me. I think about the problem and solution, describe what I need implemented, provide references in the context (“the endpoint should be structured like this one…”) and then evaluate the output.

It sounds like other folks are more throwing an LLM at the problem to see what it comes up with. More akin to how I delegate a problem to one of my human engineers/architects. I understand, conceptually, why they might be doing that but I know that I stopped trying that because it didn’t produce quality. I wonder if the newer models are better at handling that ambiguity better.


I don't understand how people measure how much more or less work they need to do. It's not that gpt-4o was incapable of exuding enormous amounts of code quickly, it's that the tokens were relativ garbage.

How do you have an opinion on 4.6/4.7 here? It's less clear but I could totally see that 4.7 or beyond leads to project completion 20% faster, by removing dead ends, foot guns, less backtracking, etc.

How to tell / measure effectively? No clue.


My personal opinion here based on observations not empirical tested. 4.5 could generate code, but I often ran out of context and the results were regularly incomplete. The result was that I had to spend as much time proofing and debugging as I did making direct progress.

4.6 has what in practice seems to an almost unlimited context window and rarely produces incomplete or flat out wrong results. That is a big step forward though i do burn through quota much faster.

I have not formed an option yet how what 4.7 does for me other than to say I have observed my quota being consumed faster. To be fair, I have not put 4.7 to a challenging task yet.

It honestly surprises me that someone who regularly uses Claude would not have an opion about 4.6 or even Opus vs Sonnet at this point. The lift at least for me was obvious.


haven't people been complaining lately about 4.6 getting worse?

People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...


I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.


While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.


Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds

That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.

Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.


Here's a sample-size calculator that may help illustrate the issue: https://sample-size.net/sample-size-proportions/ Put in the benchmark score of one model as p₀ and of the other model as p₁ (as a fraction between 0 and 1) and observe what kind of sample size you need to reliably observe a significant difference. The largest change between GPT 5.2 and 5.4 highlighted in https://openai.com/index/introducing-gpt-5-4/ is OSWorld-Verified going from 47.3% to to 75.0%. That's quite the difference, right? So plug in 0.473 and 0.75 and note that the required sample size per model is 55. For the software engineering tasks in SWE-Bench Pro, the change from 55.6% to 57.7% is a whopping 2.1 percentage points, which you can detect with a mere 8836 samples.

I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.


Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16


That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

There's a direct business incentive to game/cheat benchmarks, it wouldn't even be difficult to do, and besides, they have workforce-replacing AI to do it for them.

Caching some data is time consuming? They can just ask Claude to do it.

Your link shows there have been huge drops.

How is it fine?


No we increased our plans

How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.

I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old.

I'm still using 4.5 because it gets the niche work I'm using it for where 4.6 would just fight me.

Opus 4.5 is still available

Wow, they hosted it for 6 months. Truly LTS territory :)

Did you mean $45m?

At 15 mph it would take less than 2 seconds for a car at the back of a bus to reach the front of a bus. Are you suggesting that the driver of the bus was able to open and then close their door in less than two seconds? Alternatively, Are you suggesting that your daughter was driving slower than 15 mph yet was unable to stop?

I certainly believe there is room for discretion when officers write tickets, but not for passing a school bus.


While I acknowledge there is now a legality question around the use of "red light cameras". I have no sympathy for people who are not stopping for school busses: I can't stomach that the article frames this as "burden" on those driving past the bus.

"there’s evidence the program is heavily burdening residents who either can’t or don’t pay the fines."


I'm not condoning vandalism but I can empathize with the feeling that this alien thing is in my personal space, is a motorized vehicle on the sidewalk, is just as likely to cause a fall that no one will be held accountable for, and is taking a job from someone. I can see how that would be rage inducing. Perhaps surrounding it with traffic cones would be a better plan than actually damaging it.

I lived in Philadelphia (center city) and my other reaction based on simply attempting to keep a flowerpot on doorstep is, why have people not just stolen it yet?


These devices are a form of social pollution, whereby the desires and demands of others are mechanically proxied into common spaces.

When you negotiate others on the causeway, you are involved in human one-on-one exchanges with parity; each encountering the others on the level of interpersonal status, which is about the ways humans observe respect for each other.

But there can be no respect given nor received with a robot. It's an engine that's in competition for your space, presents as both a mechanical advantage and as handicapped, is not interesting nor appropriate to meet, and generally responds so stupidly and unpredictably that it's hazardous-- which makes its insertion into the commons an offense.

Combine the need for vigilance and avoidance with the realization that the robot annoyance is a proxy for someone else's privilege and as robots are instruments of private property extending deeply into common spaces and it's not a surprise to find people who are encroached upon by robots manifesting their displeasure through sabotage.


Isn't it the expected thing that LLMs degrade over time?

So, you feel the people who are providing software for free now somehow owe you support as well now that it is so easy to build software? How about making the onus on you to fix the bugs with a PR or two in the open source product you use?


I am not talking about free open source software here.

Two things I meant more specifically:

- Not every indie product is open source, so “send a PR” is often not an option. - I am mainly talking about paid products. If a developer charges money and asks users to trust them early, I think some basic follow-through comes with that.

I have a lot of respect for open source maintainers. That is a very different relationship.


You've read the post, right? Especially this part:

> It is why I paid for your app...

this is about closed-source, paid software - no PRs possible there.


Why do you say they are not eating their own dogfood? That phrase seems to suggest something different to me than "crappy support". I'm not condoning crappy support but are there any 'at scale" SaaS platforms that actually have support?

I also don't want to be the bad guys here but:

"I'm paying $200/month for Claude Max on my own dime, not my company's. I'm a Technology Director at a Fortune 50 company, using Claude personally to learn and then advocate for the right tools in our enterprise environment. That context matters for what follows."

No it does not. It makes no difference if you pay or your company pays or if your product is making money or you are self-educating. If you feel that you are not getting a $200/month return on your investment then you should cancel your subscription. I also struggle to understand why you are using a $200/month plan to do investigation and testing when there are $25/month options.


Fair pushback on the framing. "Dogfooding" to me means: does Anthropic rely on their own product under real-world conditions enough that they feel these pain points and prioritize fixing them? It's less about support and more about product reliability signals. On the credentials — you're right, it reads like I'm fishing for VIP treatment. That wasn't the intent; the point was about how enterprise AI adoption actually works (practitioners test, validate, then advocate up the chain). I probably led with it too hard. And on the $200 plan: I'm running multi-agent workflows via Claude Code that are coming close to saturating even Max tier limits — the $25 option isn't a realistic fit for that workload and I was hitting my limits quite a bit. This is to build a couple personal projects but also give it a true test of how it would be used from an enterprise perspective vs. my side projects. No doubt there is some room for me to optimize as I learn more though. Hopefully I won't need to spend $200/month in the future when I'm more skilled with my prompts, use of projects, etc. That is another opportunity to leverage an agentic framework to assist users with adoption though (which might also help to manage their scaling challenges.)


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: