Context Engineering: The Skill That Replaced Prompt Engineering
Two years ago, the job was finding magic words. "Think step by step." "You are an expert." We A/B tested phrasings like medieval alchemists, and sometimes it even worked.
That era is over. The models got good enough that phrasing matters far less than it did. What decides whether an AI feature is useful today is a different question entirely: at the moment the model generates a response, what exactly is in its context window, and is it the right material?
That question has a name now, context engineering, and it has quietly become most of my job.
Product details anonymized. Real engineering patterns.
A concrete failure
On an advertising platform I work on, users chat with an agent that manages their campaigns. Early version: decent system prompt, full set of tools, go. A user asks "why did performance drop last week?"
The agent has no idea which of the user's 40 campaigns they mean. So it either asks clarifying questions (annoying), guesses (dangerous), or calls five tools to list everything before starting the actual work (slow, and it floods its own context with irrelevant campaign data, degrading everything that comes after).
No prompt wording fixes this. The problem is that the relevant context (which campaign, what period, what does this brand consider "good performance") was never in the window.
What we actually built
Three mechanisms, none of them a prompt.
Mentions as context injection. Users can @mention a campaign in the chat, the way you would mention a colleague in Slack. The mention resolves to a structured context block (campaign config, recent metrics, status) injected before the model runs. One interaction, and the ambiguity above disappears. The UI became the context-assembly tool.
Brand context as a tool, not a paste. Every workspace has a brand kit: colors, tone, products, extracted automatically from the company's website. We do not stuff that into every conversation. The agent has a getBrandKit tool and fetches it when the task needs it: generating ad copy, yes; pulling a spend report, no. Context on demand beats context by default.
SQL instead of context-flooding. For analytical questions, the agent used to chain paginated API calls, each response dumping kilobytes of JSON into the window. The data the model needed was a 10-row aggregate; what it got was 4,000 lines of raw entities. We replaced the chain with one tool that runs SQL against the analytics warehouse and returns just the aggregate. The context stays small and dense, and answer quality jumped more than any prompt change ever achieved.
Notice the pattern: every fix was about selection and compression, not wording.
The principles I keep coming back to
Context is a budget, not a bag. Every token competes with every other token for the model's attention. Long context windows made it possible to stuff everything in; they did not make it a good idea. Irrelevant context does not just cost money, it actively degrades output, because the model attends to it.
Curate at the source. The best place to filter context is before it enters the window: a retrieval step that reranks, a tool that aggregates instead of listing, a mention system that lets the user point. The worst place is hoping the model ignores the noise.
Structure beats prose. A labeled block, <campaign id="..."> status: paused, spend_7d: ... </campaign>, outperforms the same information as a paragraph. The model parses structure more reliably, and you can validate structure programmatically.
Recency and position matter. Models attend more to the beginning and end of the window. The critical instruction or the key data goes at the edges; the reference material goes in the middle. Mundane, measurable, and it works.
Write for the next call, too. In agent loops, your tool outputs are tomorrow's context. A tool that returns clean, compact, labeled results makes every subsequent step smarter. A tool that returns raw API dumps poisons the rest of the conversation.
Where prompt engineering went
It did not disappear, it moved. The highest-leverage prompts in our system today are not the system prompt. They are the tool descriptions (which decide whether the agent picks the right tool) and the error messages (which decide whether it recovers from failures). Both are prompts read by the model at the exact moment it makes a decision. We iterate on them constantly. The system prompt, meanwhile, has barely changed in months.
What I would tell myself at the start
Stop polishing the instructions and start auditing the window. For any disappointing output, dump the full context the model actually received and read it like a code review: What is in here that should not be? What is missing? What is in prose that should be structured? What arrived at the wrong position?
Nine times out of ten the fix is there, and it is an engineering fix, not a writing one. Which is good news for us, because retrieval, aggregation, injection and compression are systems problems. That is the job we already know how to do.
Working on a similar AI project? Let's talk about it.