I'd be interested to know how o1 compares. On may days after I completed the AoC puzzles I was putting them question into o1 and it seemed to do really well.
o1 got 20 out of 25 (or 19 out of 24, depending on how you want to count). Unclear experimental setup (it's not obvious how much it was prompted), but it seems to check out with leaderboard times, where problems solvable with LLMs had clear times flat out impossible for humans.
An agent-type setup using Claude got 14 out of 25 (or, again, 13/24)