More

stalfie · 2026-04-08T12:00:29 1775649629

I don't think that's what's being hinted at. The system card seems to say that the model is both token efficient and slow in practice. Deep research modes generally work by having many subagents/large token spend. So this more likely the fact that each token just takes longer to produce, which would be because the model is simply much larger.

By epoch AIs datacenter tracking methods, anthropic has had access to the largest amount of contiguous compute since late last year. So this might simply be the end result result of being the first to have the capacity to conduct a training run of this size. Or the first seemingly successful one at any rate.

zozbot234 · 2026-04-08T12:58:57 1775653137

"Slow and token-efficient" could be achieved quite trivially by taking an existing large MoE model and increasing the amount of active experts per layer, thus decreasing sparsity. The broader point is that to end users, Mythos behaves just like Deep Research: having it be "more token efficient" compared to running swarms of subagents is not something that impacts them directly.

stalfie · 2026-04-07T07:33:39 1775547219

Non blinded self experimentation is not a useful branch of empiricism.

I had an ME/CFS patient that had tried 100s of things and documented the effects thoroughly. She had a quite impressive list. Roughly 30% had had an effect to begin with, but the trend she observed was that it lasted for around a month at most. Placebo was her overall conclusion, but she occasionally got relief anyways so we both agreed that there was no harm in continuing. I'm sure several "peptides" is on her list by now.

There is nothing new under the sun, and fad cures for diffuse conditions have come and gone many times before. This is especially the case for conditions involving pain or tiredness, which are extremely sensitive to both placebo and nocebo.

What would be revolutionary would be 2-3 double blinded RCTs showing a lasting effect. Which would be great if someone did! But you have to actually bother to do it. And personally I would put money on the outcome being "no effect".

jmcqk6 · 2026-04-07T19:16:38 1775589398

What do you think about the mis-alignment between goals here?

For medical research, the goal is to find general practices that will broadly help, and identify risks with the intervention. Even then, with many interventions, it's understood that they will effect people differently.

For individuals, they don't care about variation in communities, or standard medical practices, they are looking for relief for their specific condition.

Of course, declaring that just because something worked for one person, it should work for others, is wrong in both camps.

I feel like a big part of the disconnect here, and a big reason why people are talking past each other, is that they actually have different goals, and aren't really aware of that difference.

stalfie · 2026-04-08T13:18:59 1775654339

Well, to be honest I think the primary disconnect is in epistemological understanding. The OP did not declare peptides to be a personal revolution, he/she seemingly generalised their own experience to be widely applicable.

Basic human thought patterns usually lead people to think that anecdotes about their personal experience is valuable for understanding the world, but this is wrong. The scientific revolution basically illustrated the flaw in this premise outside of hypothesis generation. It takes specific education to make human beings truly believe that their anecdotal experiences are mostly irrelevant beyond understanding their immediate circumstances. The proportion of humanity that truly think this way is relatively small.

Understanding the world through anecdotes still works okay-ish for a lot of areas, but ascertaining relatively subjective effects of experimental pharmaceuticals is not one of them. But to many people it's non obvious that this is the case. And as a general method of thinking about this issue, it is just the wrong way to go about things.

And that's the disconnect, in my opinion. The OP drew a conclusion from a thought pattern that comes easily to human beings, but that is just wrong in this situation. Of course, perhaps this is reinforced by underlying motivations, but that's not what makes people talk past each other. These kinds of discussion are usually driven by so called "deep disagreements" in epistemological understanding, in my experience.

malfist · 2026-04-07T16:01:38 1775577698

> Non blinded self experimentation is not a useful branch of empiricism.

Amen to this. The plural of anecdote is not data.

People have been hawking snake oil for centuries, and people have been believing snail oil cured them for centuries.

stalfie · 2026-03-29T11:19:32 1774783172

The blind spot exploiting strategy you link to was found by an adverserial ML model...

stalfie · 2026-03-26T08:46:14 1774514774

This counterpoint doesn't address the issue, and I would argue that it is partially bad faith.

Yes, making it to the test center is significantly harder, but in fact the humans could have solved it from their home PC instead, and performed the exact same. However, if they were given the same test as the LLMs, forbidden from input beyond JSON, they would have failed. And although buying robots to do the test is unfeasible, giving LLMs a screenshot is easy.

Without visual input for LLMs in a benchmark that humans are asked to solve visually, you are not comparing apples to apples. In fact, LLMs are given a different and significantly harder task, and in a benchmark that is so heavily weighted against the top human baseline, the benchmark starts to mean something extremely different. Essentially, if LLMs eventually match human performance on this benchmark, this will mean that they in fact exceed human performance by some unknown factor, seeing as human JSON performance is not measured.

Personally, this hugely decreased my enthusiasm for the benchmark. If your benchmark is to be a North star to AGI, labs should not be steered towards optimizing superhuman JSON parsing skills. It is much more interesting to steer them towards visual understanding, which is what will actually lead the models out into the world.

stalfie · 2026-03-26T12:15:03 1774527303

I just realized that this also means that the benchmark is in practice unverified by third parties, as all tasks are not verified to be solvable through the JSON interface. Essentially there is no guarantee that it is even possible to understand how to complete every task optimally through the JSON interface alone.

I assume you did not develop the puzzles by visualizing JSON yourselves, and so there might be non obvious information that is lost in translation to JSON. Until humans optimally solve all the puzzles without ever having seen the visual version, there is no guarantee that this is even possible to do.

I think the only viable solution here is to release a version of the benchmark with a vision only harness. Otherwise it is impossible to interpret what LLM progress on this benchmark actually means.

stalfie · 2026-03-26T20:37:31 1774557451

Oookay. I actually tried the harness myself, and there was a visual option. It is unclear to me if that is what the models are using on the official benchmark, but it probably is. This probably means that much of my critique is invalid. However, in the process of fiddling with the harness, building a live viewer to see what was happening, and playing through the agent API myself, I might have found 3-4 bugs with the default harness/API. Dunno where to post it, so of all places I am documenting the process on HN.

Bug 1: The visual mode "diff" image is always black, even if the model clicked on an interactive element and there was a change. Codex fixed it in one shot, the problem was in the main session loop at agent.py (line 458).

Bug 2: Claude and Chatgpt can't see the 128x128 pixel images clearly, and cannot or accurately place clicks on them either. Scaling up the images to 1028x1028 pixels gave the best results, claude dropped off hard at 2048 for some reason. Here are the full test results when models were asked to hit specific (manually labeled) elements on the "vc 33" level 1 (upper blue square, lower blue square, upper yellow rectangle, lower yellow rectangle):

Model | 128 | 256 | 512 | 1024 | 2048

claude-opus-4-6 | 1/10 | 1/10 | 9/10 | 10/10 | 0/10

gemini-3-1-pro-preview | 10/10 | 10/10 | 10/10 | 10/10 | 10/10

gpt-5.4-medium | 4/10 | 8/10 | 9/10 | 10/10 | 8/10

Bug 3: "vc 33" level 4 is impossible to complete via the API. At least it was when I made a web-viewer to navigate the games from the API side. The "canal lock" required two clicks instead of one to transfer the "boat" when water level were equilibriated, and after that any action whatsoever would spontaneously pop the boat back to the first column, so you could never progress.

"Bug" 4: This is more of a complaint on the models behalf. A major issue is that the models never get to know where they clicked. This is truly a bit unfair since humans get a live update of the position of their cursor at no extra cost (even a preview of the square their cursor highlights in the human version), but models if models fuck up on the coordinates they often think they hit their intended targets even though they whiffed the coordinates. So if that happens they note down "I hit the blue square but I guess nothing happened", and for the rest of the run they are fucked because they conclude the element is not interactive even though they got it right on the first try. The combination of an intermediary harness layer that let the models "preview" their cursor position before the "confirmed" their action and the 1024x1024 resolution caused a major improvement in their intended action "I want to click the blue square" actually resulting in that action. However, even then unintended miss-clicks often spell the end of a run (Claude 4.6 made it the furthest, which means level 2 of the "vc 33" stages, and got stuck when it missed a button and spent too much time hitting other things)

After I tried to fix all of the above issues, and tried to set up an optimal environment for models to get a fair shake, the models still mostly did very badly even when they identified the right interactive elements...except for Claude 4.6 Opus! Claude had at least one run where it made it to level 4 on "vc 33", but then got stuck because the blue squares it had to hit became too small, and it just couldn't get the cursor in the right spot even with the cursor preview functionality (the guiding pixel likely became too small for it to see clearly). When you read through the reasoning for the previous stages though, it didn't truly fully understand the underlying logic of the game, although it was almost there.

stalfie · 2026-03-23T13:02:35 1774270955

This is incredibly naive. Hunter gatherer communities, especially those in regions without an abundance of food, are and were extremely selective about who were accepted and who weren't. This starts from infancy where non-desirable babies were simply killed. Estimates vary greatly but perhaps around a third to half of "modern" hunter gatherer tribes practice infanticide. The stated reasoning behind infanticides is often extremely vicious and comes down to "he/she is not a good fit for the tribe", or in other words "nobody likes him/her". This fact alone might be one of the major explanations of the high rate of prehistoric infant mortality.

But if you are even allowed to grow up and become an individual, things might be somewhat better once you are part of the in-group, but that does not factor in the fact that human empathy has an overall tendency to switch off if you're not. Even if you're loved because you're kin, your neighboring tribe might still kill you, or you and your kin might kill them, for entirely petty or cynical reasons. The prehistoric bone record supports this as well, seemingly human-weapon related reasons is the most common cause of death.

You can also examine your own emotions to get some idea of our evolutionary environment. Loneliness hurts, to the point where it has measurable negative health impacts equivalent to smoking a pack of cigarettes each day. Your brain is screaming at you not to be lonely, but why? Well, in our ancestral environment, being excluded from the social group meant death, so most individuals that did not have a profound and visceral fear of that happening got their genes consistently removed from the gene pool. For loneliness to be that big of deal, being excluded must have been an easily available option. If everyone loved and accepted everyone unconditionally, this emotional state would simply not have evolved.

Humans quickly become extremely brutal once the environment necessitates it, up to and including cannibalizing your own kin. Infanticide and murder of both ingroups and outgroups is historically commonplace because it was also commonplace prehistorically. Even modern tribes, that live in relative abundance, are still brutal in many ways to this very day.

But of course, when you look at any group of individuals in a tribe survivorship bias will dictate that it all looks nice and rosy. But you might want to check the skeletons in the cave before you pick that as your conclusion.

meta_gunslinger · 2026-03-23T13:28:35 1774272515

Don't wake him up from his "Noble Savage" fever-dream.

uoaei · 2026-03-23T13:41:06 1774273266

Yes let's all revel in the sunlight of Enlightenment thinking. That's really going well.

order-matters · 2026-03-23T13:29:42 1774272582

>when you look at any group of individuals in a tribe, survivorship bias will dictate that it all looks nice and rosy

there is a lot of conjecture in your overall post, but I think this is a fair takeaway you put at the end.

stalfie · 2026-03-26T22:53:31 1774565611

Late reply, but in case you'll check I think most of what I said is sourced from sources of varying quality and salience, but at least it's sourced from somewhere. But I just typed it all out quickly without checking anything over, so a lot might be wrong. But it's not entirely pulled out of my ass at least.

Evolutionary history is of course always difficult. I think the loneliness part comes mostly from the kurzgesagt video on loneliness, as well as some other stuff here and there. Rate of infanticide is roughly correct with quick Google. Rest of tribal stuff is from a variety of books and high school social anthropology. I think I actually have the "reasoning for infanticide" part from sex at dawn, of all places.

I'm always scared to run a deep research service to find the counterpoints after I type this kind of stuff out, but feel free to do so for me and dress me down. At least survivorship bias is a classic that's pretty much always worth keeping in mind on any topic.

stalfie · 2026-03-22T19:17:06 1774207026

Well, to be fair, judging by the shift in the general vibes of the average HN comment over the past 3 years, better use of agents and advanced models DID solve the previous temporary setbacks. The techno-optimists were right, and the nay-sayers wrong.

Over the course of about 2 years, the general consensus has shifted from "it's a fun curiosity" to "it's just better stackoverflow" to "some people say it's good" to "well it can do some of my job, but not most of it". I think for a lot of people, it has already crossed into "it can do most of my job, but not all of it" territory.

So unless we have finally reached the mythical plateau, if you just go by the trend, in about a year most people will be in the "it can do most of my job but not all" territory, and a year or two after that most people will be facing a tool that can do anything they can do. And perhaps if you factor in optimisation strategies like the Karpathy loop, a tool that can do everything but better.

Upper managment might be proven right.

dwaltrip · 2026-03-22T19:59:46 1774209586

If self-driving is any indication, it may take 10+ years to go from 90% to 95%.

hrmtst93837 · 2026-03-23T07:09:16 1774249756

[flagged]

scrollaway · 2026-03-23T07:49:14 1774252154

Your definition of a glorified autocomplete is … oof. So in short, “try ask it to do something you’d hate on bad code you’d yourself fail at and it might fail”.

And I’m pretty sure I could try Claude on a repo as you describe and it wouldn’t in fact fail. You’re letting your opinions of what LLMs were like a few months ago influence what you think of them now.

Comments like yours really annoy me because they are ridiculously confident about AI being “glorified autocomplete”, but also clearly not informed about the capabilities. I don’t get how some people can be on HN and not actually … try these things, be curious about them, try them on hard problems.

I’m a good engineer. I’ve coded for 24 years at this point. Yesterday in 45 minutes I built a feature that would have taken me three months without AI. The speed gains are obscene and because of this, we can build things we would never have even started before. Software is accelerating.

stalfie · 2026-03-21T15:21:39 1774106499

Alternatives naturally become more viable over time as more and more people find car use impossible, but its kind of hard to tell in advance which lanes of public transport are most necessary to improve. So imo the best solution is just to do it, and then see what happens and adapt. It's too hard to plan out everything in advance, and if you try you get deadlocked politically and nothing ends up happening. So you just find the best lever you can to reduce traffic immediately, and just start pressing it. But you warn everyone that you're pressing it, and when you do so you do it slowly.

The reality is that a lot of traffic is simply unnecessary, and dissipates once you add some friction. The most extreme example of that is the rise of remote work during and after Covid. As it turns out, none of these people actually needed to go anywhere.

And more generally, cars induce their own demand simply by virtue of being the fastest and most comfortable option, and they shape the environment around them to depend on them. Small local shops get outcompeted by distant behemoths due it being more convenient to drive. People move to a large house in a distant suburb rather than a small apartment because they know it's just thirty minutes away from work by car anyways. The easier it is to drive, the more entrenched driving becomes. And any way you slice it, undoing that process will cause pain, so you might as well go ahead and start, because you're never going to find a way to prevent the consequences anyway.

stalfie · 2026-03-13T10:24:28 1773397468

This is a nice case study of the downside of creating explicit policies of "no AI comments" without a technical method of enforcing it. I am sure the hacker news comment quality will suffer almost as much from an escalating culture of accusation and paranoia that it will from LLM comment themselves.

stalfie · 2026-03-12T12:07:13 1773317233

Without a technical means to enforce this, the only result of this policy will be a culture of paranoia and a lot of false positives.

bayindirh · 2026-03-12T12:10:48 1773317448

I'll kindly disagree, even me, as someone who doesn't use any "Chat" tools from big three, can feel if something is AI generated. We're slowly being educated into detecting it. This is why human brain is awesome.

Every model, every computer generation has a subtle signature, and we (as in humans) can understand it.

Moreover, here is a very human-enforced place. Many of us already doesn't like to be answered by a bot here, so community is also a deterrent. Plus, having an official guideline will multiply that deterrent.

Not everything is lost. Have some faith in your fellow humans.

stalfie · 2026-03-10T09:51:21 1773136281

To be fair to the companies, the machine was pretty hard to make, and expensive. Its not exactly unreasonable to charge for it.

davidw · 2026-03-10T19:06:34 1773169594

I don't begrudge the companies charging use their machine, but writing all the code and prose and everything else that the machine ingested was hard too.

jacquesm · 2026-03-10T12:05:33 1773144333

If I build a museum full of stolen art it can still be hard to make and expensive. It would be entirely unreasonable to charge for it.