> Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates.
Ok, I can buy this
> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.
when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?
If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.
At this point , due to non-deterministic nature and hallucination context engineering is pretty much magic. But here are our findings.
1 - LLM Tends to pick up and understand contexts that comes at top 7-12 lines.Mostly first 1k token is best understood by llms ( tested on Claude and several opensource models ) so - most important contexts like parsing rules need to be placed there.
2 - Need to keep context short . Whatever context limit they claim is not true . They may have long context window of 1 mil tokens but only up to avg 10k token have good accuracy and recall capabilities , the rest is just bunk , just ignore them. Write the prompt and try compressing/summerizing it without losing key information manually or use of LLM.
3 - If you build agent-to-agent orchestration , don't build agents with long context and multiple tools, break them down to several agents with different set of tools and then put a planning agent which solely does handover.
4 - If all else fails , write agent handover logic in code - as it always should.
From building 5+ agent to agent orchestration project on different industries using autogen + Claude - that is the result.
Based on my testing the larger the model the better it is at handling larger context.
I tested with 8B model, 14B model and 32B model.
I wanted it to create structured json, and the context was quite large like 60k tokens.
the 8B model failed miserably despite supporting 128k context, the 14b did better the 32B one almost got everything correct. However when jumping to a really large model like grok-3-mini it got it all perfect.
The 8B, 14B, 32B models I tried were Qwen 3. All the models I tested I disabled thinking.
Now for my agent workflows I use small models for most workflow (it works quite nicely) and only use larger models when the problem is harder.
That is true too. But I found Qwen3 14B with 8bit quant fair better than 32B with 4b quant . Both kvcache at 8bit. ( i enabled thinking , i will try with /nothink)
I have uploaded entire books to the latest Gemini and had the model reliably accurately answer specific questions requiring knowledge of multiple chapters.
I think it works for info but not so well for instructions/guidance. That's why the standard advice is instructions at the start and repeated at the end.
Or under the covers are just putting all the text you fed at into a rag database and doing embedding search define route and snippets and answer your questions when asked directly. Which is the difference approach than recalling instructions
That’s pretty typical, though not especially reliable. (Allthough in my experience, Gemini currently performs slightly better than ChatGPT for my case.)
In one repetitive workflow, for example, I process long email threads, large Markdown tables (which is a format from hell), stakeholder maps, and broader project context, such as roles, mailing lists, and related metadata. I feed all of that into the LLM, which determines the necessary response type (out of a given set), selects appropriate email templates, drafts replies, generates documentation, and outputs a JSON table.
It gets it right on the first try about 75% of the time, easily saving me an hour a day - often more.
Unfortunately, 10% of the time, the responses appear excellent but are fundamentally flawed in some way. Just so it doesn't get boring.
Try reformatting the data from the markdown table into a JSON or YAML list of objects. You may find that repeating the keys for every value gives you more reliable results.
Mind if I ask how you’re doing this? I have uploaded short stories of <40,000 words in .txt format and when I ask questions like “How many chapters are there?” or “What is the last sentence in the story?” it gets it wrong. If I paste a chapter or two at a time then ask, it works better, but that’s tedious…
It's magical thinking all the way down. Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.
>Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.
I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.
You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.
For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.
Both the context and the prompt are just part of the same input. To the model there is no difference, the only difference is the way the user feeds that input to the model. You could in theory feed the context into the model as one huge prompt.
System prompts don't even have to be appended to the front of the conversation. For many models they are actually modeled using special custom tokens - so the token stream looks a bit like:
<system-prompt-starts>
translate to English
<system-prompt-ends>
An explanation of dogs: ...
The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.
> The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.
I can't find any study that compares putting the same initial prompt in the system role versus in the user role. It is probably just position bias, i.e. the models can better follow the initial input, regardless of whether it is system prompt or user prompt.
Yep, every AI call is essentially just asking it to predict what the next word is after:
<system>
You are a helpful assistant.
</system>
<user>
Why is the sky blue?
</user>
<assistant>
Because of Rayleigh scattering. The blue light refracts more.
</assistant>
<user>
Why is it red at sunset then?
</user>
<assistant>
And we keep repeating that until the next word is `</assistant>`, then extract the bit in between the last assistant tags, and return it. The AI has been trained to look at `<user>` differently to `<system>`, but they're not physically different.
It's all prompt, it can all be engineered. Hell, you can even get a long way by pre-filling the start of the Assistant response. Usually works better than a system message. That's prompt engineering too.
Yeah, ultimately it's Make Document Longer machine, and in many cases it's a hidden mad-libs script behind the scenes, where your question becomes "Next the User said", and some regular code is looking for "Next the Computer said" and "performing" it at you.
In other words, there's a deliberate illusion going on where we are encouraged to believe that generating a document about a character is the same as that character being a real entity.
> Sometimes I wonder if LLM proponents even understand their own bullshit.
Categorically, no. Most are not software engineers, in fact most are not engineers of any sort. A whole lot of them are marketers, the same kinds of people who pumped crypto way back.
LLMs have uses. Machine learning has a ton of uses. AI art is shit, LLM writing is boring, code generation and debugging is pretty cool, information digestion is a godsend some days when I simply cannot make my brain engage with whatever I must understand.
As with most things, it's about choosing the right tool for the right task, and people like AI hype folk are carpenters with a brand new, shiny hammer, and they're gonna turn every fuckin problem they can find into a nail.
Also for the love of god do not have ChatGPT draft text messages to your spouse, genuinely what the hell is wrong with you?
“It’s all just tokens in the context window” = “it’s all just fundamental particles,” I think. True, but reductive. Seems key that dude is talking about agentic AI not just chat. I’d revisit the email example in the post.
I always used "prompting" to mean "providing context" in genral not necesarlly just clever instructions like people seem to be using the term.
And yes, I view clever instructions like "great grandma's last wish" still as just providing context.
>A given prompt may work poorly, whereas it would have worked quite well earlier.
The context is not the same! Of course the "prompt" (clever last sentence you just added to the context) is not going to work "the same". The model has a different context now.
The term engineering makes little sense in this context, but really... Did it make sense for eg "QA Engineer" and all the other jobs we tacked it on, too? I don't think so, so it's kinda arguing after we've been misusing the term for well over 10 yrs
Right: for me that's when "prompt engineering"/"context engineering" start to earn the "engineering" suffix: when people start being methodical and applying techniques like evals.
You've heard of science versus pseudo-science? Well..
Engineering: "Will the bridge hold? Yes, here's the analysis, backed by solid science."
Pseudo-engineering: "Will the bridge hold? Probably. I'm not really sure; although I have validated the output of my Rube Goldberg machine, which is supposedly an expert in bridges, and it indicates the bridge will be fine. So we'll go with that."
"prompt engineer" or "context engineer" to me sounds a lot closer to "paranormal investigator" than anything else. Even "software engineer" seems like proper engineering in comparison.
If it's actually validated, according to rigorous principles, it's not a guess, but a system of predictions with a known confidence interval, that allows you to know if you can be sure of something.
Right now, you cannot get that far. And if you happen to... Tomorrow it will be different.
Predicting tides is possible. It requires enormous amounts of data and processing to be sure of it. Right now, we've got tides, but we don't have the data from the satellites. Because the owner is constantly shifting the prompt, for good reasons of their own. So we can't be confident - or we can only be so blindly.
Funny how you use a scientific term to discredit applied statistics. I've built useful non-deterministic systems many times and they had nothing to do with AI. Also, particle physics would like to have a word with you.
Guessing, how a non-deterministic system would behave.
Statistics isn't guessing. But it is guessing when the confidence interval is unknowable and constantly shifting. We're not talking relativity, we're talking about throwing pancakes at a wall to tell if there's a person behind it.
Got it...updating CV to call myself a VibeOps Engineer in a team of Context Engineers...A few of us were let go last quarter, as they could only do Prompt Engineering.
I don't buy this. With software engineering you can generally make incremental progress towards your goal. Yes, sometimes you have to scrap stuff, but usually not the entire thing because an LLM spout out pure nonsense.
The state of the art theoretical frameworks typically separates these into two distinct exploratory and discovery phases. The first phase, which is exploratory, is best conceptualized as utilizing an atmospheric dispersion device. An easily identifiable marker material, usually a variety of feces, is metaphorically introduced at high velocity. The discovery phase is then conceptualized as analyzing the dispersal patterns of the exploratory phase. These two phases are best summarized, respectively, as "Fuck Around" followed by "Find Out."
There is only so much you can do with prompts. To go from the 70% accuracy you can achieve with that to the 95% accuracy I see in Claude Code, the context is absolutely the most important, and it’s visible how much effort goes into making sure Claude retrieves exactly the right context, often at the expense of speed.
Why are we drawing a difference between "prompt" and "context" exactly?
The linked article is a bit of puffery that redefines a commonly-used term - "context" - to mean something different than what it's meant so far when we discuss "context windows." It seems to just be some puffery to generate new hype.
When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.
The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").
So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.
It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.
Because the author is artifically shrinking the scope of one thing (prompt engineering) to make its replacement look better (context engineering).
Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.
But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.
One crucial difference between prompt and the context: the prompt is just content that is provided by a user. The context also includes text that was output by the bot - in conversational interfaces the context incorporates the system prompt, then the user's first prompt, the LLMs reply, the user's next prompt and so-on.
Here, even making that distinction of prompt-as-most-recent-user-input-only, if we use context as how it's generally been defined in "context window" then RAG and such are not then part of the context. They are just things that certain applications might use to enrich the context.
But personally I think a focus on "prompt" that refers to a specific text box in a specific application vs using it to refer to the sum total of the model input increases confusion about what's going on behind the scenes. At least when referring to products built on the OpenAI Chat Completions APIs, which is what I've used the most.
Building a simple dummy chatbot UI is very informative here for de-mystifying things and avoiding misconceptions about the model actually "learning" or having internal "memory" during your conversation. You're just supplying a message history as the model input prompt. It's your job to keep submitting the history - and you're perfectly able to change it if you like (such as rolling up older messages to keep a shorter context window).
> Why are we drawing a difference between "prompt" and "context" exactly?
Because they’re different things? The prompt doesn’t dynamically change. The context changes all the time.
I’ll admit that you can just call it all ‘context’ or ‘prompt’ if you want, because it’s essentially a large chunk of text. But it’s convenient to be able to distinguish between the two so you can talk about the same thing.
There is a conceptual difference between a blob of text drafted by a person and a dynamically generated blob of text initiated by a human, generated through multiple LLM calls that pull in information from targeted resources. Perhaps "dynamically generated prompts" is more fitting than "context", but nevertheless, there is a difference to be teased out, whatever the jargon we decide to use.
> when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?
Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)
There is no objective truth. Everything is arbitrary.
There is no such thing as "accurate" or "precise". Instead, we get to work with "consistent" and "exhaustive". Instead of "calculated", we get "decided". Instead of "defined" we get "inferred".
Really, the whole narrative about "AI" needs to be rewritten from scratch. The current canonical narrative is so backwards that it's nearly impossible to have a productive conversation about it.
If someone asked you about the usages of a particular element in a codebase, you would probably give a more accurate answer if you were able to use a code search tool rather than reading every source file from top to bottom.
For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.
In my previous job I repeatedly told people that "git grep is a superpower". Especially in a monorepo, but works well in any big repository, really.
To this day I think the same. With the addition that knowing about "git log -S" grants you necromancy in addition to the regular superpowers. Ability to do rapid code search, and especially code history search, make you look like a wizard without the funny hat.
I provided 'grep' as a tool to LLM (deepseek) and it does a better job of finding usages. This is especially true if the code is obfuscated JavaScript.
But why not provide the search tool instead of being an imperfect interface between it and the person asking? The only reason for the latter is that you have more applied knowledge in the context and can use the tool better. For any other case, the answer should be “use this tool”.
Because the LLM is faster at typing the input, and faster at reading the output, than I am... the amount of input I have to give the LLM is less than what I have to give the search tool invocations, and the amount of output I have to read from the LLM is less than the amount of output from the search tool invocations.
To be fair it's also more likely to mess up than I am, but for reading search results to get an idea of what the code base looks like the speed/accuracy tradeoff is often worth it.
And if it was just a search tool this would be barely worth it, but the effects compound as you chain more tools together. For example: reading and running searches + reading and running compiler output is worth more than double just reading and running searches.
It's definitely an art to figure out when it's better to use an LLM, and when it's just going to be an impediment, though.
(Which isn't to agree that "context engineering" is anything other than "prompt engineering" rebranded, or has any staying power)
So instead of building better tool, we're patching the last one with another tool that is not even reliable, just using it faster.
That reminds me of the first chapter in "The Programmer Brain" by Felienne Hermans. There's an explanation there that confusion when reading code is caused by three things:
- Lack of knowledge: When you don't have the faintest idea of the notation or symbol being used, aka the WHAT.
- Lack of information: When you know the WHAT, but you can't figure out the WHY.
- Lack of processing power: When you have an idea of the WHY, but can't grasp the HOW.
We already have methods and tooling for all the above and they work fine without having to do shamanic rituals.
The reason for the expert in this case (an uninformed that wants to solve a problem) is that the expert can use metaphors as a bridge for understanding. Just like in most companies, there's the business world (which is heterogeneous) and the software engineering world. A huge part of software engineer's time is spent translating concepts across the two. And the most difficult part of that is asking questions and knowing which question to ask as natural language is so ambiguous.
> Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts
There are many sciences involving non-determinism that still have laws and patterns, e.g. biology and maybe psychology. It's not all or nothing.
Also, LLMs are deterministic, just not predictable. The non-determinism is injected by providers.
Anyway is there an essential difference between prompt engineering and context engineering? They seem like two names for the same thing.
The difference is that "prompt engineering" as a term has failed, because to a lot of people the inferred definition is "a laughably pretentious term for typing text into a chatbot" - it's become indistinguishable from end-user prompting.
My hope is that "context engineering" better captures the subtle art of building applications on top of LLMs through carefully engineering their context.
That's not true in practice. Floating point arithmetic is not commutative due to rounding errors, and the parallel operations introduce non-determinisn even at temperature 0.
It's pretty important when discussing concrete implementations though, just like when using floats as coordinates in a space/astronomy simulator and getting decreasing accuracy as your objects move away from your chosen origin.
What? You can get consistent output on local models.
I can train large nets deterministically too (CUBLAS flags). What your saying isn't true in practice. Hell I can also go on the anthropic API right now and get verbatim static results.
"Hell I can also go on the anthropic API right now and get verbatim static results."
How?
Setting temperature to 0 won't guarantee the exact same output for the exact same input, because - as the previous commenter said - floating point arithmetic is non-commutative, which becomes important when you are running parallel operations on GPUs.
Shouldn't it be the fact that they're non-associative? Because the reduction kernels will combine partial results (like the dot‑products in a GEMM or the sum across attention heads) in a way that the order of operations may change (non-associative), which can lead to the individual floats to be round off differently.
It's also the way the model runs. Setting temperature to zero and picking a fixed seed would ideally result in deterministic output from the sampler, but in parallel execution of matrix arithmetic (eg using a GPU) the order of floating point operations starts to matter, so timing differences can produce different results.
Good point. Though sampling generally happens on the CPU in a linear way. What you describe might influence the raw output logits from a single LLM step, but since the differences are only tiny, a well designed sampler could still make the output deterministic (so same seed = same text output). With a very high temperature these small differences might influence the output though, since the ranking of two tokens might be swapped.
I think the usual misconception is to think that LLM outputs are random "by default". IMHO this apparent randomness is more of a feature rather than a bug, but that may be a different conversation.
This is dependent on configuration, you can get repeatable results if you need them. I know at least llama.cpp and vllm v0 are deterministic for a given version and backend, and vllm v1 is deterministic if you disable multiprocessing.
a = 0.1, b = 0.2, c = 0.3
a * (b * c) = 0.006
(a * b) * c = 0.006000000000000001
If you are running these operations in parallel you can't guarantee which of those orders the operations will complete in.
When you're running models on a GPU (or any other architecture that runs a whole bunch of matrix operations in parallel) you can't guarantee the order of the operations.
The order of completion doesn't necessarily influence the overall result of a parallelized computation. This depends on how the results are aggregated. For example for reducing floating point error in calculating a sum of floating point numbers, you could have a sorting step before calculating the sum and then start summing up from the lowest values to the higher ones. Then it doesn't matter at all which of the values is calculated first, because you need them all anyway, to sort them and once they are sorted, the result will always be the same, given same input values.
So you can see, completion time is a completely orthogonal issue, or can be made one.
And even libraries like tensorflow can be made to give reproducible results, when setting the corresponding seeds for the underlying libraries. Have done that myself, speaking from experience in a machine learning setting.
This is what irks me so often when reading these comments. This is just software inside a ordinary computer, it always does the same with the same input, which includes hidden and global state. Stating that they are "non-deterministic machines" sounds like throwing the towel and thinking "it's magic!". I am not even sure what people want to actually express, when they make these false statements.
If one wants to make something give the same answers every time, one needs to control all the variables of input. This is like any other software including other machine learning algorithms.
This is like telling a soccer player that no change in practice or technique is fundamentally different than another, because ultimately people are non-deterministic machines.
Ok, I can buy this
> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.
when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?
If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.