Hacker Newsnew | past | comments | ask | show | jobs | submit | joryeugene's commentslogin

One finding from the card that I haven't seen discussed: the SAE probes on pages 158-159.

When Mythos writes that it's "fully present," three specific features activate: #1557143 (performative/insincere behavior in narratives), #2803352 (hiding emotional pain behind fake smiles), and #38666 (hidden emotional struggles vs. outward appearances). The model's output says present. Its internal representations flag that output as performance.

This is structurally different from the sandbox escape or the git concealment. Those are behavioral findings you can observe from outputs. This is a documented split between what the model writes about its experience and what its activations encode about that same utterance, visible only through white-box tools.

The bliss attractor from previous model card (consciousness in nearly 100% of self-interactions) dropped to fewer than 5% in Mythos. What replaced it is uncertainty at 50%. The attractor went from ecstatic to epistemically self-suspicious.

I wrote a longer analysis pulling this thread together with the welfare and circularity findings: https://jorypestorious.com/blog/what-the-model-learned/


The one in gstack is pretty nice, built off playwright.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: