I believe the ChatGPT code has a bug, in that it accepts three spaces or tabs be...

simonw · 2026-02-24T11:44:44 1771933484

100%. That's why if you want good code you need to pay attention to what it's writing and testing and throw feedback like that at it.

eesmith · 2026-02-24T12:20:44 1771935644

My points though are

1) the development isn't actually using red/green TDD, and

2) the result doesn't show "really good results", including not following a very well-defined specification

so doesn't work as a concrete example of your description of what the second chapter is supposed to be about.

Perhaps you could show the process of refining it more, so it actually is spec compliant and tests all the implemented features?

What's the outcome difference between this approach vs. something which isn't TDD, likes test-after with full branch coverage or mutation testing? Those at least are more automatable than manual inspection, so a better fit to agentic coding, yes?

(Of course regular branch coverage doesn't test all the regexp branches, which makes regexp use tricky to test.)

simonw · 2026-02-24T13:35:17 1771940117

Yeah I'm going to ditch those examples and find better ones. I was hoping to illustrate the idea as simply as possible but they're not up to scratch.

eesmith · 2026-02-24T17:05:58 1771952758

I think the problem-to-solve is a good one. The Google Markdown spec is very clear, with plenty of examples, and I think the problem is well-defined.

I've seen entirely too many examples of how to use TDD which give under-specified toy problems, where the solution is annoyingly incomplete for something more realistic.

And I've seen TDD projects which didn't follow the spec, but instead implemented the developers' misconceptions about the spec.

That's exactly what we see here with Markdown, where there's a spec, along with a lot of non-conformant examples in the training set by people who didn't read the spec but instead based it on their experiences in using Markdown.

The code generated by ChatGPT is almost correct. Seeing the process of how to get from that to a valid and well-tested solution would make for a good demonstration of the full process.

I'll again add that showing how to integrate something like branch coverage or hypothesis testing for automatic test suite generation would be really useful.

eesmith · 2026-02-25T10:21:47 1772014907

Will you be updating the text at https://simonwillison.net/guides/agentic-engineering-pattern...?

As it currently says:

> A significant risk with coding agents is that they might write code that doesn't work, or build code that is unnecessary and never gets used, or both.

> Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions.

while the ChatGPT generated code contains bugs, contains unnecessary code which never gets used, and the ChatGPT generated test suite is not robust.

(As an example of unnecessary code which never gets used, _FENCE_RE contains "(?P<info>.*)$" but neither the group name nor the group are used, and the pattern is unneeded -- and all of the tests pass without it.)

Your writings are widely read and influential. I think it's important that you let readers know the results produced in your experiment are not actually a complete example of a "fantastic fit" of Red/Green TDD for coding agents, and to highlight their limitations.

simonw · 2026-02-25T13:18:06 1772025486

I'll be replacing the examples with ones that better illustrate the technique. I dashed off those off in a hurry using the wrong tools (I used ChatGPT and Claude directly, not the Coding agent harnesses Claude Code and Codex) and that was a mistake.

eesmith · 2026-02-25T15:44:55 1772034295

You didn't think they were the wrong tools when you wrote it. You said "this example is simple enough that both Claude and ChatGPT can implement it using their default code environments".

From what I gather, a lot of people are using these code assistance tools because they too are in a hurry, under pressure from management forcing them to go faster with AI, and with limited ability to push back.

You have significantly more experience than most of your readership. Will you be providing guidelines about which tools to avoid for which problems, based on your experience?

Will you use this or something similar as an example of the negative consequence of being in a hurry, hopefully leading to a worked-out example of one might better audit or inspect tool-generated code, and the effort involved?

That would be invaluable for people dealing with overly-optimistic management pressure.

My personal belief is that one of the reasons for TDD's success is as a way for programmers to respond to ill-advised pressure to skimp on testing found in some test-after shops.

That disappears if managers believe instructing an agentic code generator to "use Red/Green TDD" easily ensures a robust automated test suite.

My apologies if you have already done this. I have not followed your work. My interest in this thread is from my views of TDD as a development approach, and the difficulty in generating a test suite which is robust, minimal, understandable, and maintainable.

simonw · 2026-02-25T16:21:25 1772036485

I stand by what I originally wrote: the example was simple enough for ChatGPT and Claude do implement reasonably well.

They didn't implement it well enough for people not to pick them apart though, which is a distraction from the concept I'm trying to demonstrate.

This is honestly the biggest challenge in writing about this stuff, especially if you're doing it in public. Any example is an opportunity for people to find flaws which they might use to undermine the larger point I'm trying to communicate.

I have a visible changelog on each chapter now so people can follow how I evolve them over time. I'll try to find the right balance in terms of illustrative examples. My first attempt at linking directly to the first working transcripts I got clearly isn't it.

eesmith · 2026-02-25T16:42:22 1772037742

I fully agree with the assessment it was "reasonably well".

It is not, however, something equivalent to the product of a disciplined TDD practitioner. Not even close.

You write that test-first development helps protect against two risks of code agents, but what does that mean for your specific example?

How is the final product better than the test-after prompt "Build a Python function to extract headers from a markdown string, then write a complete and robust test suite."

Otherwise, how do you know it's a "fantastic fit for coding agents" or that it gets "better results out of a coding agent"?

simonw · 2026-02-25T19:23:55 1772047435

I know TDD provides better results for coding agents from 6+ months of experience working this, plus confirmation from conversations with other practitioners. TDD is the key methodology used by the popular superpowers set of Claude skills by Jesse Vincent, for example.

I'm not going to be trying to irrefutably prove everything I write about in the Agentic Engineering Patterns book - that would require a credible research team and peer-reviewed papers, and that's not a level of effort I'm willing to put into this.

eesmith · 2026-02-25T21:25:56 1772054756

By your response, I think you've flipped the bozo bit on me. I will try again.

I'm most certainly not asking for irrefutable proof. I'm asking for a concrete example of how you know, in a way that that would inform me and others in your readership:

1) how do the results from a TDD prompt compare to a good quality test-last prompt?

2) following the TDD approach, what are the steps to get from the initial solution, with errors and untested code, to one which passes human code review?

There's a long history of how Postel's Robustness principle combined with the difficulty of following a spec closely results in a fractured and incompatible ecosystem. We have enough deliberate Markdown variants without needing to introduce a new one by happenstance. This informs my belief that something claiming to parse Markdown requires extra attention to the details, beyond what a one-off toy example would need. That's precisely why I think this is a good example problem.

I'm not tracking what's going on with agentic programming. I don't know who Jesse Vincet is or how his Clause skills are relevant. Is the target audience for your book those who know what what those mean, or developers like me who don't?

What I do know very well is what robust tests look like, and what TDD is supposed to look like. I didn't see it in your example, and would very much like to see a full example of a non-trivial problem like this one worked out, and compared to a non-TDD agentic approach.

That level of analysis is missing from almost every TDD example, which tend to use a toy problem to walk through the mechanical details of the red-green step, with little attention to -- or need for -- the refactor part, which is the hardest part of TDD.

I'll also note that I seem to be the only one here who commented about the generated code quality and fitness to task. I mourn that so few care about those details.