Hacker Newsnew | past | comments | ask | show | jobs | submit | HarHarVeryFunny's commentslogin

The Z80 itself was "inspired" by the 8080, notably having dual 8080 register sets. It might be regarded as a "clear" (sic) room reimplemention/enhancement of the 8080 given that it was the same 8080 designers who left Intel to found Zilog and create the Z80.

Perhaps why the title said "clear" room ?

At first I thought it was brain slip in the HN title, then I saw TFA also said "clear", so thought it was perhaps a sarcastic jab at the original "clean" room story it is commenting on, but maybe in the end just an error ?

In any case, an interesting experiment.


It would also be interesting to see how well the best open weights models such as Kimi K2.5 can do on a task like this with the same prompting to first gather specs, etc, etc.

In fact this would make for an interesting benchmark - writing entire non-trivial apps based on the same prompt. Each model might be expected to write and use it's own test cases, but then all could be judged based on a common set of test cases provided as part of the benchmark suite.


I think that kind of inertia mostly lasts as long as there is no financial incentive to move. A ChatGPT user who is not paying anything to OpenAI is of little benefit to them, and has little incentive to switch. However if OpenAI start trying to make money off those users by adding advertising, or removing the free tier, then things may change. Google can afford to subsidize chat from their other revenue streams, but OpenAI can't.

>However if OpenAI start trying to make money off those users by adding advertising, or removing the free tier, then things may change.

Tech forums tend to be in a bit of a bubble. People said the same thing about Netflix and it just quickly became their most popular sub. People don't care about advertising unless it's really obnoxious.

The idea that people will unsub en masse once Open AI starts rolling ads is a pipe dream. And the kind of user that won't pay and won't suffer some ads is the kind of user nobody wants.


Customers come back to Netflix since they have the best content out of all the streaming providers. This is their moat.

ChatGPT, on the other hand, is literally exactly identical to their competitors for the most common use cases.


Customers stay at Netflix because it's cheap, what they're used to, and it has enough on the catalogue to keep people satisfied most of the time. They're not constantly evaluating who has the better catalogue. And most of that catalogue is content they have no real ownership of anyway, at least, until the WB buyout is finalized.

And Netflix is hardly the only example. Like clockwork, people here say the same thing about anyone including ads, to the same result - No-one cares.

This is just one of those things that is popular to say in these kinds of forums but has no bearing in real life. Most people are sticky with products they're satisfied with. They don't switch unless a competitor is:

- much cheaper

- much better

Neither of these is the case in the LLM consumer space. Nobody cares or notices that gemini topped the benchmarks for a couple months before being dethroned, and as far as new features and improvements is concerned, Open AI is the clear leader. All everyone did and still does is follow their lead, even down to the pricing model. Basically every single feature/model improvement you can think of in the LLM consumer space is something Open AI brought first and they get almost all the buzz from it.


There are all sorts of scenarios one could imagine .. maybe your neighbor works at the car wash and will drive your car there and meet you after you've walked there, etc, etc, but part of having human-level intelligence, which is what LLMs are striving for, is to be able to handle questions (more importantly real world ones, not just "gotcha" puzzles) in human-like fashion, and to have good enough "theory of mind" to read between the lines when someone asks a question, and understand that they've most likely included all relevant information that you would not automatically assume as part of the question.

The only good answers to the car wash questions are either a) "well, duh, drive, since you're gonna need your car there to wash it" (or just "drive", recognizing this as a logic/gotcha puzzle, with no explanation required), or b) "is there something you are not telling me here that makes walking, leaving your car at home, a viable option when the goal is to have your car at the car wash to wash it?".


Sure, if an open ended response was allowed, but if it was a multiple choice question then you'd have to use your common sense and pick one.

However, the important issue here really isn't about the ability of humans or LLMs to recognize logic puzzles. If you were asking an LLM for real world advice, trying to be as straightforward as possible, you may still get a response just as bad as "walk", but not be able to recognize that it was bad, and the reason for the failure would be exactly the same as here - failure to plan and reason through consequences.

It's toy problems like this that should make you step back once in a while and remind yourself of how LLMs are built and how they are therefore going to fail.


I highly doubt that more than a tiny fraction of the human failures are due to having misunderstood the question. Much more likely the human failures are for the same reason the LLMs are failing - failure to reason, and instead spitting out a surface level pattern match type answer.

This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.


Maybe relevant to this is that today Dario Amodei is meeting with Pete Hegseth in what Hegseth is describing as a "shit or get off the pot" meeting, with one of the issues being that Hegseth is unhappy with Amodei's unwillingness to have Anthropic models used to make autonomous (no human in the loop) life or death decisions.

Maybe Hegseth should be reading this thread, and/or doing a little reading up on paperclip production maximization.


Fundamentally the failure here is one of reasoning/planning - either of not reasoning about the implicit requirements (in this case extremely obvious - in order to wash my car at the car wash, my car needs to be at the car wash) to directly arrive at the right answer, and/or of not analyzing the consequences of any considered answer before offering it as the answer.

While this is a toy problem, chosen to trick LLMs given their pattern matching nature, it is still indicative of their real world failure modes. Try asking an LLM for advice in tackling a tough problem (e.g. bespoke software design), and you'll often get answers whose consequences have not been thought through.

In a way the failures on this problem, even notwithstanding the nature of LLMs, are a bit surprising given that this type of problem statement kinda screams out (at least to a human) that it is a logic test, but most of the LLMs still can't help themselves and just trigger off the "50m drive vs walk" aspect. It reminds a bit of the "farmer crossing the river by boat in fewest trips" type problem that used to be popular for testing LLMs, where a common failure was to generate a response that matched the pattern of ones it had seen during training (first cross with A and B, then return with X, etc), but the semantics were lacking because of failure to analyze the consequences of what it was suggesting (and/or of planning better in the first place).


I wasn't aware that Antigravity personal provides free access to Opus/Sonnet! Maybe this is just for limited time, but certainly to be taken advantage of! Thanks!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: