Some subscriptions offer "unlimited tokens" for certain models. i.e. GitHub co-pilot can be unlimited for GPT-4o and GPT-4.1 (and, actually, GPT-5 mini!). So: I spent some time with those models to see what level of scaffolding and breaking things down (hand holding) was required to get them to complete a task.
Why would I do that? Well, I wanted to understand more deeply how differences in my prompting might impact the outcomes of the model. I also wanted to get generally better at writing prompts. And of course, improving at controlling context and seeing how models can go off the rails. Just by being better at understanding these patterns, I feel more confident in general at when and how to use LLMs in my daily work.
I think, in general, understanding not only that earlier models are weaker, but also _how_ they are weaker, is useful in its own right. It gives you an extra tool to use.
I will say, the biggest findings for "weaknesses" I've found are in training data. If you're keeping your libraries up-to-date, and you're using newer methods or functionality from those libraries, AI will constantly fail to identify with those new things. For example, Zod v4 came out recently and the older models absolutely fail to understand that it uses some different syntax and methods under the hood. Jest now supports `using` syntax for its spyOn method, and models just can't figure it out. Even with system prompts and telling them directly, the existing training data is just too overpowering.
Find another developer and pair/work together on a project. It doesn't need to be serious, but you should organize it like it is. So, a breakdown of tasks needed to accomplish the goal first. And then many pull requests into the source that can be peer reviewed.
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
I don't think that splitting into subagents that use the same model will really help. I need to clarify this in the post, but the split is 1) so I can use Sonnet to code and save on some tokens and 2) so I can get other models to review, to get a different perspective.
It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.
I'm considering using subagents, as a way to manage context and delegate "simple" tasks to cheaper models (if you want to see tokens burn, watch Opus try fixing a misplaced ')' in a Lisp file!).
I see what you mean w.r.t. different hats; but is it useful to have different tools available? For example, a "planner" having Web access and read-only file access, versus a "developer" having write access to files but no Web access?
Aren't down votes on this forum restricted to 500+ karma? And how would those compare to flagging? I'd hate for people under 500 karma to think they need to flag a post in order to have it get any attention by moderation. And, with your idea that LLMs help folks write, wouldn't that make the community worse for them?
I should clarify — I disagree with disallowing any comments that used LLMs in the writing. I think comments should be judged on their quality, not on how they were written.
I might agree (don't know) with the idea of limiting new accounts more heavily.
> I disagree with disallowing any comments that used LLMs in the writing.
I think the point here is that the community doesn't want to read AI slop, not that using an LLM to clean up your writing contains some inherent evil that prevents quality.
I don't want to accuse you of strawmanning the argument, but honestly, where did you ever see someone advocating the latter?
Yeah, unfortunately there are bots here that are much better at hiding that and even do language mistakes on purpose.
It's still a small minority of comments, but it's definitely getting a problem and just the chance — even if it's small one — of talking to a bot, rather than a human causes inhibition. Finding out that one has been talking to a bot is finding out you've been scammed. You invest time and human emotions into something for another human to read, even if it's just a quick HN comment, just to find out that it was all for nothing. It sucks the humanity out of it and thereby out of oneself. You get tricked into spending your valuable limited human social energy on soulless machines with infinite capacity of generating worthless slop instead of on other humans.
- Learning French and Japanese
- Drinking tea and exploring tea culture
- Playing Geoguessr (or in my case, Geotastic) to see different places in the world and just generally have a fun time figuring out languages, and unique traits of different countries and cultures
- Reading and watching science fiction, lots and lots of it!
- Coding. I really do love writing code.
- Trying to improve my communication skills, and ability to break down and describe tasks. It relates to AI, but also relates to working and interacting with other humans. :)
- Working
- Hanging out with my family
- Eating yummy food that I enjoy
- Doing crosswords
- Playing video games
- Writing
- Rock climbing, other physical activities like walking or doing a quick work out
- Cleaning and fixing the home.
These don't solve all of the problems, but they make me feel better about my own life and that positivity helps me interact with others and think more positively about the world and all the cool things in it.
In our org that would not fly. They would be required to break it down. Did you or anyone tell them they need to make it readable for the rest of the team?
I understand where you're coming from, and think there is something missing in your final paragraph that I'm curious to understand. If LLMs do end up improving productivity, what would make them go away? I think automated code generators are here until something more performant supersedes them. So, what in your mind might be possibilities of that thing?
Well I guess I no longer believe that long term, all this code generation would make us more productive. At least not how the fan favorite claude-code currently does it.
I've found some power use cases with LLMs, like "explore", but everyone seems misty eye'd that these coding agents can one-shot entire features. I suspect it'll be fine until it's not and people get burned by what is essentially trusting these black boxes to barf out entire implementations leaving trails of code soup.
Worse is that junior engineers can say they're "more productive" but it's now at the expense of understanding what it is they just contributed.
So, sure, more productive, but in the same way that 2010s move fast and break things philosophy was, "more productive." This will all come back to bite us eventually.
It's definitely sort of that. You can run your own server, as well, though this comes with its own limitations (and inherently takes away from the want for more players). Most of the developers have varying goals with the project. When I was working on the project, my care was primarily for making the game as close to the original as possible using "replays" by dedicated players before the original shut down. It was fun to write code for something that felt like it would give some folks a nostalgia hit.
I think optimally, you'd do something more akin to a "group ironman" with some friends. This guarantees you've got others around.
Why would I do that? Well, I wanted to understand more deeply how differences in my prompting might impact the outcomes of the model. I also wanted to get generally better at writing prompts. And of course, improving at controlling context and seeing how models can go off the rails. Just by being better at understanding these patterns, I feel more confident in general at when and how to use LLMs in my daily work.
I think, in general, understanding not only that earlier models are weaker, but also _how_ they are weaker, is useful in its own right. It gives you an extra tool to use.
I will say, the biggest findings for "weaknesses" I've found are in training data. If you're keeping your libraries up-to-date, and you're using newer methods or functionality from those libraries, AI will constantly fail to identify with those new things. For example, Zod v4 came out recently and the older models absolutely fail to understand that it uses some different syntax and methods under the hood. Jest now supports `using` syntax for its spyOn method, and models just can't figure it out. Even with system prompts and telling them directly, the existing training data is just too overpowering.