> Building powerful and reliable AI Agents is becoming less about finding a magi...

v3ss0n · 2025-07-01T09:46:07 1751363167

At this point , due to non-deterministic nature and hallucination context engineering is pretty much magic. But here are our findings.

1 - LLM Tends to pick up and understand contexts that comes at top 7-12 lines.Mostly first 1k token is best understood by llms ( tested on Claude and several opensource models ) so - most important contexts like parsing rules need to be placed there.

2 - Need to keep context short . Whatever context limit they claim is not true . They may have long context window of 1 mil tokens but only up to avg 10k token have good accuracy and recall capabilities , the rest is just bunk , just ignore them. Write the prompt and try compressing/summerizing it without losing key information manually or use of LLM.

3 - If you build agent-to-agent orchestration , don't build agents with long context and multiple tools, break them down to several agents with different set of tools and then put a planning agent which solely does handover.

4 - If all else fails , write agent handover logic in code - as it always should.

From building 5+ agent to agent orchestration project on different industries using autogen + Claude - that is the result.

zacksiri · 2025-07-01T14:18:46 1751379526

Based on my testing the larger the model the better it is at handling larger context.

I tested with 8B model, 14B model and 32B model.

I wanted it to create structured json, and the context was quite large like 60k tokens.

the 8B model failed miserably despite supporting 128k context, the 14b did better the 32B one almost got everything correct. However when jumping to a really large model like grok-3-mini it got it all perfect.

The 8B, 14B, 32B models I tried were Qwen 3. All the models I tested I disabled thinking.

Now for my agent workflows I use small models for most workflow (it works quite nicely) and only use larger models when the problem is harder.

v3ss0n · 2025-07-01T19:53:08 1751399588

That is true too. But I found Qwen3 14B with 8bit quant fair better than 32B with 4b quant . Both kvcache at 8bit. ( i enabled thinking , i will try with /nothink)

lblume · 2025-07-01T11:59:11 1751371151

I have uploaded entire books to the latest Gemini and had the model reliably accurately answer specific questions requiring knowledge of multiple chapters.

FeepingCreature · 2025-07-01T12:39:41 1751373581

I think it works for info but not so well for instructions/guidance. That's why the standard advice is instructions at the start and repeated at the end.

grogenaut · 2025-07-01T19:13:16 1751397196

Or under the covers are just putting all the text you fed at into a rag database and doing embedding search define route and snippets and answer your questions when asked directly. Which is the difference approach than recalling instructions

raybb · 2025-07-01T17:24:34 1751390674

I wonder if the serial-position effect is happening with LLMs.

https://en.wikipedia.org/wiki/Serial-position_effect

potatolicious · 2025-07-01T17:54:29 1751392469

Something like it definitely, though not exactly. We also know that recall improves with proximate position of bits within the context.

Adherence to context is lossy in a way reminiscent of human behavior but also different in crucial ways.

HSO · 2025-07-01T14:25:52 1751379952

I wonder if those books were already in the training set, i.e. in a way "hardcoded" before you even steered the model that way.

jimbokun · 2025-07-01T16:28:56 1751387336

Should be easy to test: ask the question without the book in the context window, ask again with the book in the context window.

fwn · 2025-07-01T12:42:19 1751373739

That’s pretty typical, though not especially reliable. (Allthough in my experience, Gemini currently performs slightly better than ChatGPT for my case.)

In one repetitive workflow, for example, I process long email threads, large Markdown tables (which is a format from hell), stakeholder maps, and broader project context, such as roles, mailing lists, and related metadata. I feed all of that into the LLM, which determines the necessary response type (out of a given set), selects appropriate email templates, drafts replies, generates documentation, and outputs a JSON table.

It gets it right on the first try about 75% of the time, easily saving me an hour a day - often more.

Unfortunately, 10% of the time, the responses appear excellent but are fundamentally flawed in some way. Just so it doesn't get boring.

simonw · 2025-07-01T13:10:31 1751375431

Try reformatting the data from the markdown table into a JSON or YAML list of objects. You may find that repeating the keys for every value gives you more reliable results.

fwn · 2025-07-02T10:25:04 1751451904

Thanks for the suggestion! I’ll start benchmarking my current md table setup against one using YAML. It's apparently slightly less verbose than JSON.

v3ss0n · 2025-07-01T19:55:15 1751399715

Gemini does a lot better at long context.

yahoozoo · 2025-07-02T03:06:48 1751425608

Mind if I ask how you’re doing this? I have uploaded short stories of <40,000 words in .txt format and when I ask questions like “How many chapters are there?” or “What is the last sentence in the story?” it gets it wrong. If I paste a chapter or two at a time then ask, it works better, but that’s tedious…

v3ss0n · 2025-07-01T19:53:59 1751399639

Try multi-turn and agent-to-agent , it will breakdown , but Gemini is a lot better at larger context.

zvitiate · 2025-07-01T17:38:45 1751391525

Claude’s system prompt is SO long though that the first 1k lines might not be as relevant for Gemini, GPT, or Grok.

mentalgear · 2025-06-30T22:25:27 1751322327

It's magical thinking all the way down. Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

nonethewiser · 2025-07-01T03:44:32 1751341472

>Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.

You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.

For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.

Turskarama · 2025-07-01T07:33:14 1751355194

Both the context and the prompt are just part of the same input. To the model there is no difference, the only difference is the way the user feeds that input to the model. You could in theory feed the context into the model as one huge prompt.

__loam · 2025-07-01T11:37:25 1751369845

Sometimes I wonder if LLM proponents even understand their own bullshit.

It's all just tokens in the context window right? Aren't system prompts just tokens that stay appended to the front of a conversation?

They're going to keep dressing this up six different ways to Sunday but it's always just going to be stochastic token prediction.

simonw · 2025-07-01T12:28:41 1751372921

System prompts don't even have to be appended to the front of the conversation. For many models they are actually modeled using special custom tokens - so the token stream looks a bit like:

  <system-prompt-starts>
  translate to English
  <system-prompt-ends>
  An explanation of dogs: ...

The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.

throwdbaaway · 2025-07-01T16:08:52 1751386132

> The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.

I can't find any study that compares putting the same initial prompt in the system role versus in the user role. It is probably just position bias, i.e. the models can better follow the initial input, regardless of whether it is system prompt or user prompt.

StevenWaterman · 2025-07-01T12:24:25 1751372665

Yep, every AI call is essentially just asking it to predict what the next word is after:

  <system>
  You are a helpful assistant.
  </system>
  <user>
  Why is the sky blue?
  </user>
  <assistant>
  Because of Rayleigh scattering. The blue light refracts more.
  </assistant>
  <user>
  Why is it red at sunset then?
  </user>
  <assistant>

And we keep repeating that until the next word is `</assistant>`, then extract the bit in between the last assistant tags, and return it. The AI has been trained to look at `<user>` differently to `<system>`, but they're not physically different.

It's all prompt, it can all be engineered. Hell, you can even get a long way by pre-filling the start of the Assistant response. Usually works better than a system message. That's prompt engineering too.

Terr_ · 2025-07-02T00:09:36 1751414976

Yeah, ultimately it's Make Document Longer machine, and in many cases it's a hidden mad-libs script behind the scenes, where your question becomes "Next the User said", and some regular code is looking for "Next the Computer said" and "performing" it at you.

In other words, there's a deliberate illusion going on where we are encouraged to believe that generating a document about a character is the same as that character being a real entity.

phkahler · 2025-07-01T13:35:26 1751376926

This is why I enjoy calling AI "autocomplete" when people make big claims about it - because that's where it came from and exactly what it is.

mat_b · 2025-07-01T15:44:00 1751384640

AI is not autocomplete. LLMs are autocomplete.

phkahler · 2025-07-01T17:43:22 1751391802

Yes. That's what I meant.

smokel · 2025-07-01T18:06:03 1751393163

Depending on what you mean exactly, "autocomplete" and big claims are not mutually exclusive.

ToucanLoucan · 2025-07-01T13:22:42 1751376162

> Sometimes I wonder if LLM proponents even understand their own bullshit.

Categorically, no. Most are not software engineers, in fact most are not engineers of any sort. A whole lot of them are marketers, the same kinds of people who pumped crypto way back.

LLMs have uses. Machine learning has a ton of uses. AI art is shit, LLM writing is boring, code generation and debugging is pretty cool, information digestion is a godsend some days when I simply cannot make my brain engage with whatever I must understand.

As with most things, it's about choosing the right tool for the right task, and people like AI hype folk are carpenters with a brand new, shiny hammer, and they're gonna turn every fuckin problem they can find into a nail.

Also for the love of god do not have ChatGPT draft text messages to your spouse, genuinely what the hell is wrong with you?

tilne · 2025-07-01T17:25:27 1751390727

Leaving the “g” of the f word at the end made me re-read this in Fat Tony’s voice. It was an awesome touch.

buffzebra · 2025-07-03T00:24:39 1751502279

“It’s all just tokens in the context window” = “it’s all just fundamental particles,” I think. True, but reductive. Seems key that dude is talking about agentic AI not just chat. I’d revisit the email example in the post.

pennaMan · 2025-07-01T08:18:11 1751357891

I always used "prompting" to mean "providing context" in genral not necesarlly just clever instructions like people seem to be using the term.

And yes, I view clever instructions like "great grandma's last wish" still as just providing context.

>A given prompt may work poorly, whereas it would have worked quite well earlier.

The context is not the same! Of course the "prompt" (clever last sentence you just added to the context) is not going to work "the same". The model has a different context now.

ffsm8 · 2025-07-01T04:40:06 1751344806

Yeah, if anything it should be called an art.

The term engineering makes little sense in this context, but really... Did it make sense for eg "QA Engineer" and all the other jobs we tacked it on, too? I don't think so, so it's kinda arguing after we've been misusing the term for well over 10 yrs

groestl · 2025-07-01T06:06:09 1751349969

Well, to get the right thing into the context in a performant way when you dealing with a huge dataset is definitely engineering.

shakna · 2025-07-01T08:54:52 1751360092

Engineering tends to mean "the application of scientific and mathematical principles to practical ends".

I'm not sure there's much scientific or mathematical about guessing how a non-deterministic system will behave.

SonOfLilit · 2025-07-01T11:37:57 1751369877

The moment you start building evaluation pipelines and running experiments to validate your ideas it stops being guessing

simonw · 2025-07-01T12:30:05 1751373005

Right: for me that's when "prompt engineering"/"context engineering" start to earn the "engineering" suffix: when people start being methodical and applying techniques like evals.

passwordqwe · 2025-07-01T14:04:50 1751378690

Relevant XKCD: https://xkcd.com/397/ About if it's science or not, the difference is testing it through experiment.

ModernMech · 2025-07-01T15:49:29 1751384969

You've heard of science versus pseudo-science? Well..

Engineering: "Will the bridge hold? Yes, here's the analysis, backed by solid science."

Pseudo-engineering: "Will the bridge hold? Probably. I'm not really sure; although I have validated the output of my Rube Goldberg machine, which is supposedly an expert in bridges, and it indicates the bridge will be fine. So we'll go with that."

"prompt engineer" or "context engineer" to me sounds a lot closer to "paranormal investigator" than anything else. Even "software engineer" seems like proper engineering in comparison.

groestl · 2025-07-02T05:50:55 1751435455

Engineering: "Will the bridge hold? Yes, with a confidence of 99.95%"

grugagag · 2025-07-01T13:55:32 1751378132

It’s validated and filtered but isn’t it still guessing at the core? Should we call it validated guessing?

shakna · 2025-07-02T12:14:33 1751458473

If it's actually validated, according to rigorous principles, it's not a guess, but a system of predictions with a known confidence interval, that allows you to know if you can be sure of something.

Right now, you cannot get that far. And if you happen to... Tomorrow it will be different.

Predicting tides is possible. It requires enormous amounts of data and processing to be sure of it. Right now, we've got tides, but we don't have the data from the satellites. Because the owner is constantly shifting the prompt, for good reasons of their own. So we can't be confident - or we can only be so blindly.

groestl · 2025-07-02T05:49:23 1751435363

I think a validated guess is exactly what a prediction is.

groestl · 2025-07-02T05:45:09 1751435109

Funny how you use a scientific term to discredit applied statistics. I've built useful non-deterministic systems many times and they had nothing to do with AI. Also, particle physics would like to have a word with you.

shakna · 2025-07-02T12:11:25 1751458285

Guessing, how a non-deterministic system would behave.

Statistics isn't guessing. But it is guessing when the confidence interval is unknowable and constantly shifting. We're not talking relativity, we're talking about throwing pancakes at a wall to tell if there's a person behind it.

sethammons · 2025-07-01T13:02:32 1751374952

"Context Crafting"

belter · 2025-07-01T11:42:59 1751370179

Got it...updating CV to call myself a VibeOps Engineer in a team of Context Engineers...A few of us were let go last quarter, as they could only do Prompt Engineering.

tootie · 2025-07-01T13:57:11 1751378231

You say "magic" I say "heuristic"

ironmagma · 2025-07-01T07:52:35 1751356355

What is all software but tinkering?

I mean this not as an insult to software dev but to work generally. It’s all play in the end.

8n4vidtmkvmk · 2025-07-01T18:00:49 1751392849

I don't buy this. With software engineering you can generally make incremental progress towards your goal. Yes, sometimes you have to scrap stuff, but usually not the entire thing because an LLM spout out pure nonsense.

surecoocoocoo · 2025-07-01T07:05:44 1751353544

We used to define a specification.

In other words; context.

But that was like old man programming.

As the laws of physics changed between 1970 and 2009.

prmph · 2025-07-01T16:14:30 1751386470

Is this Haiku?

edwardbernays · 2025-06-30T21:38:54 1751319534

The state of the art theoretical frameworks typically separates these into two distinct exploratory and discovery phases. The first phase, which is exploratory, is best conceptualized as utilizing an atmospheric dispersion device. An easily identifiable marker material, usually a variety of feces, is metaphorically introduced at high velocity. The discovery phase is then conceptualized as analyzing the dispersal patterns of the exploratory phase. These two phases are best summarized, respectively, as "Fuck Around" followed by "Find Out."

Aeolun · 2025-07-01T04:06:06 1751342766

There is only so much you can do with prompts. To go from the 70% accuracy you can achieve with that to the 95% accuracy I see in Claude Code, the context is absolutely the most important, and it’s visible how much effort goes into making sure Claude retrieves exactly the right context, often at the expense of speed.

majormajor · 2025-07-01T04:51:48 1751345508

Why are we drawing a difference between "prompt" and "context" exactly? The linked article is a bit of puffery that redefines a commonly-used term - "context" - to mean something different than what it's meant so far when we discuss "context windows." It seems to just be some puffery to generate new hype.

When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.

The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").

So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.

It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.

arugulum · 2025-07-01T05:09:45 1751346585

Because the author is artifically shrinking the scope of one thing (prompt engineering) to make its replacement look better (context engineering).

Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.

But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.

simonw · 2025-07-01T04:55:09 1751345709

One crucial difference between prompt and the context: the prompt is just content that is provided by a user. The context also includes text that was output by the bot - in conversational interfaces the context incorporates the system prompt, then the user's first prompt, the LLMs reply, the user's next prompt and so-on.

majormajor · 2025-07-01T05:03:40 1751346220

Here, even making that distinction of prompt-as-most-recent-user-input-only, if we use context as how it's generally been defined in "context window" then RAG and such are not then part of the context. They are just things that certain applications might use to enrich the context.

But personally I think a focus on "prompt" that refers to a specific text box in a specific application vs using it to refer to the sum total of the model input increases confusion about what's going on behind the scenes. At least when referring to products built on the OpenAI Chat Completions APIs, which is what I've used the most.

Building a simple dummy chatbot UI is very informative here for de-mystifying things and avoiding misconceptions about the model actually "learning" or having internal "memory" during your conversation. You're just supplying a message history as the model input prompt. It's your job to keep submitting the history - and you're perfectly able to change it if you like (such as rolling up older messages to keep a shorter context window).

Aeolun · 2025-07-01T10:44:41 1751366681

> Why are we drawing a difference between "prompt" and "context" exactly?

Because they’re different things? The prompt doesn’t dynamically change. The context changes all the time.

I’ll admit that you can just call it all ‘context’ or ‘prompt’ if you want, because it’s essentially a large chunk of text. But it’s convenient to be able to distinguish between the two so you can talk about the same thing.

__loam · 2025-07-01T11:41:31 1751370091

It's all the same blob of text in the api call

chestervonwinch · 2025-07-01T12:52:27 1751374347

There is a conceptual difference between a blob of text drafted by a person and a dynamically generated blob of text initiated by a human, generated through multiple LLM calls that pull in information from targeted resources. Perhaps "dynamically generated prompts" is more fitting than "context", but nevertheless, there is a difference to be teased out, whatever the jargon we decide to use.

FeepingCreature · 2025-07-01T12:40:55 1751373655

There's always been a distinction between prompt and data.

simonw · 2025-07-01T13:09:05 1751375345

LLM's can't distinguish between instruction prompts and data prompts - that's why prompt injection attacks exist.

FeepingCreature · 2025-07-03T11:25:18 1751541918

I agree, and that's a problem. It doesn't mean the distinction doesn't exist, in fact it shows the opposite.

oblio · 2025-07-01T12:53:21 1751374401

Spoken like a non Lisp programmer.

dinvlad · 2025-06-30T22:37:39 1751323059

> when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)

thomastjeffery · 2025-07-01T15:53:03 1751385183

Models are Biases.

There is no objective truth. Everything is arbitrary.

There is no such thing as "accurate" or "precise". Instead, we get to work with "consistent" and "exhaustive". Instead of "calculated", we get "decided". Instead of "defined" we get "inferred".

Really, the whole narrative about "AI" needs to be rewritten from scratch. The current canonical narrative is so backwards that it's nearly impossible to have a productive conversation about it.

andy99 · 2025-06-30T22:50:31 1751323831

It's called over-fitting, that's basically what prompt engineering is.

evjan · 2025-07-01T08:59:44 1751360384

That doesn't sound like how I understand over-fitting, but I'm intrigued! How do you mean?

felipeerias · 2025-07-01T02:38:43 1751337523

If someone asked you about the usages of a particular element in a codebase, you would probably give a more accurate answer if you were able to use a code search tool rather than reading every source file from top to bottom.

For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.

bostik · 2025-07-01T20:09:01 1751400541

In my previous job I repeatedly told people that "git grep is a superpower". Especially in a monorepo, but works well in any big repository, really.

To this day I think the same. With the addition that knowing about "git log -S" grants you necromancy in addition to the regular superpowers. Ability to do rapid code search, and especially code history search, make you look like a wizard without the funny hat.

manishsharan · 2025-07-01T12:30:59 1751373059

I provided 'grep' as a tool to LLM (deepseek) and it does a better job of finding usages. This is especially true if the code is obfuscated JavaScript.

skydhash · 2025-07-01T11:01:46 1751367706

But why not provide the search tool instead of being an imperfect interface between it and the person asking? The only reason for the latter is that you have more applied knowledge in the context and can use the tool better. For any other case, the answer should be “use this tool”.

gpm · 2025-07-01T13:45:52 1751377552

Because the LLM is faster at typing the input, and faster at reading the output, than I am... the amount of input I have to give the LLM is less than what I have to give the search tool invocations, and the amount of output I have to read from the LLM is less than the amount of output from the search tool invocations.

To be fair it's also more likely to mess up than I am, but for reading search results to get an idea of what the code base looks like the speed/accuracy tradeoff is often worth it.

And if it was just a search tool this would be barely worth it, but the effects compound as you chain more tools together. For example: reading and running searches + reading and running compiler output is worth more than double just reading and running searches.

It's definitely an art to figure out when it's better to use an LLM, and when it's just going to be an impediment, though.

(Which isn't to agree that "context engineering" is anything other than "prompt engineering" rebranded, or has any staying power)

skydhash · 2025-07-02T04:42:37 1751431357

So instead of building better tool, we're patching the last one with another tool that is not even reliable, just using it faster.

That reminds me of the first chapter in "The Programmer Brain" by Felienne Hermans. There's an explanation there that confusion when reading code is caused by three things:

- Lack of knowledge: When you don't have the faintest idea of the notation or symbol being used, aka the WHAT.

- Lack of information: When you know the WHAT, but you can't figure out the WHY.

- Lack of processing power: When you have an idea of the WHY, but can't grasp the HOW.

We already have methods and tooling for all the above and they work fine without having to do shamanic rituals.

__loam · 2025-07-01T11:45:57 1751370357

The uninformed would rather have a natural language interface rather than learn how to actually use the tools.

skydhash · 2025-07-01T12:25:18 1751372718

The reason for the expert in this case (an uninformed that wants to solve a problem) is that the expert can use metaphors as a bridge for understanding. Just like in most companies, there's the business world (which is heterogeneous) and the software engineering world. A huge part of software engineer's time is spent translating concepts across the two. And the most difficult part of that is asking questions and knowing which question to ask as natural language is so ambiguous.

autobodie · 2025-07-01T05:14:25 1751346865

Tha problem is that "right" is defined circularly

ninetyninenine · 2025-07-01T04:42:36 1751344956

Yeah but do we have to make a new buzz word out of it? "Context engineer"

FridgeSeal · 2025-06-30T23:56:45 1751327805

It’s just AI people moving the goalposts now that everyone has realised that “prompt engineering” isn’t a special skill.

coliveira · 2025-07-01T01:50:52 1751334652

In other words, "if AI doesn't work for you the problem is not IA, it is the user", that's what AI companies want us to believe.

shermantanktop · 2025-07-01T02:20:59 1751336459

That’s a good indicator of an ideology at work: no-true-Scotsman deployed at every turn.

j45 · 2025-07-01T03:44:03 1751341443

Everything is new to someone and the tends of reference will evolve.

colordrops · 2025-07-01T07:10:53 1751353853

> Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts

There are many sciences involving non-determinism that still have laws and patterns, e.g. biology and maybe psychology. It's not all or nothing.

Also, LLMs are deterministic, just not predictable. The non-determinism is injected by providers.

Anyway is there an essential difference between prompt engineering and context engineering? They seem like two names for the same thing.

simonw · 2025-07-01T12:04:56 1751371496

They arguably are two names for the same thing.

The difference is that "prompt engineering" as a term has failed, because to a lot of people the inferred definition is "a laughably pretentious term for typing text into a chatbot" - it's become indistinguishable from end-user prompting.

My hope is that "context engineering" better captures the subtle art of building applications on top of LLMs through carefully engineering their context.

phyalow · 2025-07-01T10:22:56 1751365376

“non-deterministic machines“

Not correct. They are deterministic as long as a static seed is used.

kazga · 2025-07-01T10:27:52 1751365672

That's not true in practice. Floating point arithmetic is not commutative due to rounding errors, and the parallel operations introduce non-determinisn even at temperature 0.

SetTheorist · 2025-07-01T14:46:14 1751381174

Nitpick: I think you mean that FP arithmetic is not _associative_ rather than non-commutative.

Commutative: A+B = B+A Associative: A+(B+C) = (A+B)+C

zorked · 2025-07-01T14:03:25 1751378605

That's basically a bug though, not an important characteristic of the system. Engineering tradeoff, not math.

e12e · 2025-07-01T15:58:13 1751385493

It's pretty important when discussing concrete implementations though, just like when using floats as coordinates in a space/astronomy simulator and getting decreasing accuracy as your objects move away from your chosen origin.

phyalow · 2025-07-01T10:59:17 1751367557

What? You can get consistent output on local models.

I can train large nets deterministically too (CUBLAS flags). What your saying isn't true in practice. Hell I can also go on the anthropic API right now and get verbatim static results.

simonw · 2025-07-01T12:22:37 1751372557

"Hell I can also go on the anthropic API right now and get verbatim static results."

How?

Setting temperature to 0 won't guarantee the exact same output for the exact same input, because - as the previous commenter said - floating point arithmetic is non-commutative, which becomes important when you are running parallel operations on GPUs.

sva_ · 2025-07-02T20:57:44 1751489864

Shouldn't it be the fact that they're non-associative? Because the reduction kernels will combine partial results (like the dot‑products in a GEMM or the sum across attention heads) in a way that the order of operations may change (non-associative), which can lead to the individual floats to be round off differently.

oxidi · 2025-07-01T14:51:43 1751381503

I think lots of people misunderstand that the "non-deterministic" nature of LLMs come from sampling the token distribution, not from the model itself.

simonw · 2025-07-01T14:55:36 1751381736

It's also the way the model runs. Setting temperature to zero and picking a fixed seed would ideally result in deterministic output from the sampler, but in parallel execution of matrix arithmetic (eg using a GPU) the order of floating point operations starts to matter, so timing differences can produce different results.

oxidi · 2025-07-01T17:53:39 1751392419

Good point. Though sampling generally happens on the CPU in a linear way. What you describe might influence the raw output logits from a single LLM step, but since the differences are only tiny, a well designed sampler could still make the output deterministic (so same seed = same text output). With a very high temperature these small differences might influence the output though, since the ranking of two tokens might be swapped.

I think the usual misconception is to think that LLM outputs are random "by default". IMHO this apparent randomness is more of a feature rather than a bug, but that may be a different conversation.

pbreit · 2025-07-01T06:43:27 1751352207

What's the difference?

PeterStuer · 2025-07-01T06:21:29 1751350889

"these are non-deterministic machines"

Only if you choose so by allowing some degree of randomness with the temperature setting.

pegasus · 2025-07-01T09:00:17 1751360417

They are usually nondeterministic even at temperature 0 - due to things like parallelism and floating point rounding errors.

Gracana · 2025-07-01T14:20:06 1751379606

This is dependent on configuration, you can get repeatable results if you need them. I know at least llama.cpp and vllm v0 are deterministic for a given version and backend, and vllm v1 is deterministic if you disable multiprocessing.

PeterStuer · 2025-07-01T17:32:13 1751391133

floating point rounding errors are still deterministic. Parallelism dynamics can impact results, but those are not specific to LLM's.

simonw · 2025-07-01T17:51:17 1751392277

Here's something that isn't deterministic:

   a = 0.1, b = 0.2, c = 0.3
   a * (b * c) = 0.006
   (a * b) * c = 0.006000000000000001

If you are running these operations in parallel you can't guarantee which of those orders the operations will complete in.

When you're running models on a GPU (or any other architecture that runs a whole bunch of matrix operations in parallel) you can't guarantee the order of the operations.

zelphirkalt · 2025-07-01T19:45:14 1751399114

The order of completion doesn't necessarily influence the overall result of a parallelized computation. This depends on how the results are aggregated. For example for reducing floating point error in calculating a sum of floating point numbers, you could have a sorting step before calculating the sum and then start summing up from the lowest values to the higher ones. Then it doesn't matter at all which of the values is calculated first, because you need them all anyway, to sort them and once they are sorted, the result will always be the same, given same input values.

So you can see, completion time is a completely orthogonal issue, or can be made one.

And even libraries like tensorflow can be made to give reproducible results, when setting the corresponding seeds for the underlying libraries. Have done that myself, speaking from experience in a machine learning setting.

edflsafoiewq · 2025-07-01T08:35:48 1751358948

In the strict sense, sure, but the point is they depend not only on the seed but on seemingly minor variations in the prompt.

zelphirkalt · 2025-07-01T08:17:12 1751357832

This is what irks me so often when reading these comments. This is just software inside a ordinary computer, it always does the same with the same input, which includes hidden and global state. Stating that they are "non-deterministic machines" sounds like throwing the towel and thinking "it's magic!". I am not even sure what people want to actually express, when they make these false statements.

If one wants to make something give the same answers every time, one needs to control all the variables of input. This is like any other software including other machine learning algorithms.

csallen · 2025-07-01T01:06:52 1751332012

This is like telling a soccer player that no change in practice or technique is fundamentally different than another, because ultimately people are non-deterministic machines.