Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd be interested to know how o1 compares. On may days after I completed the AoC puzzles I was putting them question into o1 and it seemed to do really well.


According to this thread: https://old.reddit.com/r/adventofcode/comments/1hnk1c5/resul...

o1 got 20 out of 25 (or 19 out of 24, depending on how you want to count). Unclear experimental setup (it's not obvious how much it was prompted), but it seems to check out with leaderboard times, where problems solvable with LLMs had clear times flat out impossible for humans.

An agent-type setup using Claude got 14 out of 25 (or, again, 13/24)

https://github.com/JasonSteving99/agent-of-code/tree/main


I have to wonder why o1 didn't work. That post is unfortunately light on details that seem pretty important.


I was thinking 20/25 is pretty great! At least 5 of the problems were pretty tricky and easy to fail due to small errors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: