Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This doesn’t look like a reasoning ceiling. It looks like a decision reliability problem.

The unstable tier is the key result. Models that get it right 70–80% of the time are not “almost correct.” They are nondeterministic decision functions. In production that’s worse than being consistently wrong.

A single sampled output is just a proposal. If you treat it as a final decision, you inherit its variance. If you treat it as one vote inside a simple consensus mechanism, the variance becomes observable and bounded.

For something this trivial you could:

    -run N independent samples at low temperature

    -extract the goal state (“wash the car”)

    -assert the constraint (“car must be at wash location”)

    -reject outputs that violate the constraint

    -RL against the "decision open ledger"
No model change required. Just structure.

The takeaway isn’t that only a few frontier models can reason. It’s that raw inference is stochastic and we’re pretending it’s authoritative.

Reliability will likely come from open, composable consensus layers around models, not from betting everything on a single forward pass.

 help



> This doesn’t look like a reasoning ceiling. It looks like a decision reliability problem.

This doesn’t look like a human comment. It looks like a LLM response.


Fair I cleaned up the wording with ChatGPT with my review prompt. The substance matters more than the style. If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.

> If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.

I have reviewed your previous comments, and you have consistently written: that's instead of that’s. So what I read is still some LLM output, even though I think there is some kind of human behind the LLM.


Did you write this COMMENT with ChatGPT?!

Come on, man.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: