Context Engineering: What Comes After Prompt Engineering

Your AI support agent is telling customers they have 90 days to return items. Your actual policy is 30 days. A customer tried to return something on day 45, was told by a human agent that it was too late, and is now furious because “your AI told me I could.” The model didn’t hallucinate the 90-day figure out of nowhere. Something in the system fed it the wrong information, or failed to feed it the right information, or buried the right information where the model couldn’t attend to it. Before you can diagnose what went wrong (which we’ll cover in the next article), you need to understand how the agent’s context was assembled in the first place.

This is Layer 3 of the AI Fluency framework: Product Architecture. Where Layer 1 explained how models work and Layer 2 discussed how to measure whether they’re working, Layer 3 is about the application you build around the model: the retrieval pipeline, the guardrails, the orchestration.

Every product team has access to the same frontier models. The application layer, and specifically the decisions you make about what information reaches the model, is what determines whether your product is useful or not. That is what this article is about. I’ve split the layer across two articles. This one covers design: what information should reach the model, how it should be structured, and why those decisions belong to you. The next covers diagnosis: what to do when it breaks.

This article builds on the foundations from Article 2, so a quick reminder. A model processes text as tokens: roughly one token per four characters of English. The context window is everything the model can see at once: system prompt, conversation history, retrieved documents, and the current query. It is a hard limit, measured in tokens. Everything described in this article competes for space inside that window. And models do not attend equally to all of it: information at the beginning and end of the context is recalled more reliably than information buried in the middle. If those concepts are unfamiliar, read Article 2 first.

From Prompts to Context Engineering

Prompt engineering was about phrasing: “Use a step-by-step approach.” “Think like a financial expert.” “Format your output as JSON.” All of it happened in the final prompt the model saw.

Context engineering asks a different question: what information should enter the model’s context window at all? Not how to phrase the question, but what you think it should know in order to be able to answer your question or carry out your task. Retrieved documents. Conversation history. Tool descriptions. Strategic context about your business. Negative examples showing what not to do. All of this happens before the final prompt is written, and all of it competes for the same limited token budget.

A perfectly worded prompt with the wrong context produces poor results. A mediocre prompt with excellent context often outperforms it. The quality ceiling of your AI product is set by what reaches the model, not by how you phrase the question.

Why This Is a Product Decision

GitHub Copilot’s quality depends on which files it pulls into context. Open the right files in your editor and it writes code that fits your codebase. Close them and it produces generic snippets that could belong to any project. The model is the same either way. Someone at GitHub decided that “currently open files” is a useful proxy for “what is relevant to this developer right now.” That is a product question, not an infrastructure question.

Customer support bots illustrate it at the simplest level. A bot that knows your order history, your subscription tier, and your last three support tickets feels like it understands you. A bot running the same model without that context asks you to repeat your account number for the third time. Same model. Different context. Completely different experience.

The pattern holds everywhere you look: Perplexity’s search quality lives in its retrieval, not its model. ChatGPT’s memory feature is a set of product decisions about what facts to persist and when to surface them. In every case, the model is a commodity. Everyone has access to the same frontier models. What makes one product better than another is what reaches the model before it generates.

Context engineering is a product decision because it requires knowing what “relevant” means for your users in your domain. Engineering builds the pipeline. The PM defines what should flow through it.

It is also a prioritisation problem. The context window is a fixed token budget. Should you spend 2,000 tokens on detailed instructions or 2,000 tokens on more retrieved documents? Should you include the last 10 messages of conversation history or compress them to 3? These are trade-offs with direct quality, cost, and latency implications: the same kind you make when you prioritise features for a sprint, except the currency is tokens instead of engineering days.

Model providers publicise their context windows like a spec war: 200k tokens, 1 million tokens, bigger every quarter. It is tempting to assume that context engineering will stop mattering once the window is large enough to fit everything. It won’t. Bigger windows cost more per request, respond slower, and attend less reliably to information buried in the middle. Researchers call this degradation pattern context rot: model performance declines as the window fills, and the pattern of what gets lost shifts depending on how full the window is. Below 50% capacity, the model loses information in the middle. Above 50%, it starts losing the earliest tokens entirely, favouring only the most recent input. A 200k-token window filled indiscriminately performs worse than a 30k-token window filled with the right information in the right order. The constraint was never the size of the window. It is the relevance of what you put in it.

The Five Types of Context

Everything that fills a context window falls into one of five categories. Each competes for the same limited space. Each is a product decision.

  Context Window (limited tokens)
  ════════════════════════════════

  1. Instructions    ~500-1500 tokens
     Role, objective, constraints, format

  2. Examples         ~200-500 tokens per example
     Positive and negative demonstrations

  3. Knowledge        Variable
     Domain facts, task context, structured data

  4. Memory           Varies
     Short-term (chat history), long-term (stored)

  5. Tools            ~50-300 tokens per tool
     Function definitions, parameter specs

  All five compete for the same token budget.
  Every token spent on one is unavailable to others.

1. Instructions

Instructions tell the model who it is and what it should do. They include:

Role: “You are a market research analyst.” Encourages the model to adopt a persona and reasoning pattern.
Objective: Why this task matters. What success looks like. A perfectly-written objective improves the model’s autonomy by giving it strategic context, not just mechanical steps.
Requirements: The steps, conventions, constraints. “Respond in JSON. Always cite sources. Do not speculate.”

Instructions are non-negotiable. Every system prompt needs them. But instructions are also expensive: a detailed system prompt easily consumes 500-1500 tokens. The tradeoff is simple and unavoidable: the more you specify, the less room you have for context.

2. Examples

Examples show the model what you want by demonstrating it. Positive examples (“here’s a good answer”). Negative examples (“here’s what to avoid”).

Examples are more powerful than instructions alone. Three to five well-chosen examples often outperform ten pages of written guidance. But they cost tokens: each example might be 200-500 tokens.

The product decision is not whether to include examples, but which examples matter most for your use case. An example that shows the model how to cite sources correctly prevents hallucination more reliably than an instruction that says “always cite sources.”

3. Knowledge

Knowledge is external context: domain information, business strategy, market facts, system architecture, workflow procedures, structured data from your database.

A customer support agent needs knowledge about your return policy. A recruitment assistant needs knowledge of your company values. A contract reviewer needs access to your standard terms.

Knowledge fills the space between instructions (which are generic) and the specific task (which is user-generated). It’s the most variable category, because different use cases need different knowledge. A financial advisor needs market data. A healthcare chatbot needs clinical guidelines. A product manager assistant needs your roadmap and strategy docs.

4. Memory

Memory divides into short-term (conversation history within a session) and long-term (facts and preferences stored in a database and retrieved when relevant).

Short-term memory is often automatic: your orchestration layer appends the last five messages. Long-term memory is a product decision: what user preferences should you remember? Should you save conversation summaries? Do you store extracted facts that help the model make better future decisions?

Memory enables continuity. Without it, each interaction is stateless. With it, the model can reference earlier conversations, build on previous context, and adapt to individual users.

5. Tools

Tools are function definitions that the model can invoke. A tool description is a micro-prompt: name, description of what it does, parameter definitions, examples of use.

A tool description is product specification work. The difference between an agent that reliably calls the right tool and one that thrashes between wrong tools is often the clarity of the tool description.

Too many tools create noise. Forty-nine tools in an Atlassian MCP server consume 1,387 tokens just describing them. If your agent only needs the “create ticket” function, the other forty-eight tools are pure waste. Limiting tools to only what an agent needs is a core context engineering decision.

Information Retrieval Techniques

Before you assemble context, you need to retrieve it. This is where the first layer of intelligence happens.

Query rewriting: An LLM rephrases the user’s question before retrieval. “What’s your return policy for items bought in-store?” becomes “Return policy physical retail locations.” Better queries retrieve better documents.
Hybrid RAG: Combines dense vector search (semantic similarity) with sparse keyword search (exact matches). A question about “vehicle collisions” finds documents about “car accidents” through vectors, and finds documents about “traffic incidents policy” through keyword matching. Most production systems use hybrid.
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the user’s question. Turn that answer into an embedding. Use it to search for real documents similar to the hypothetical answer. Often retrieves more relevant chunks than the raw query.
Adaptive RAG: Make retrieval self-correcting. If the first retrieval round returns low-confidence results, try again with a rewritten query. If the model later indicates it needs more information, retrieve again.
Agentic RAG: Instead of one fixed retrieval step, let the agent decide when to retrieve, what to retrieve, and how to use it. This adds flexibility but also complexity and cost.

For product managers, these are not just technical options. They are quality levers. Invest in smarter retrieval and you can reduce the volume of context you need to stuff into the prompt. Invest in query rewriting and you may not need a more expensive embedding model.

Context Assembly Techniques

Retrieval finds candidates. Assembly turns candidates into the context that actually reaches the model. This is where the second layer of intelligence happens.

Filtering: Only include documents matching criteria. Last 30 days. Relevant department. High confidence. Filters reduce noise.
Deduplication: Remove near-duplicates. Embeddings tend to return similar items multiple times. Deduping saves tokens.
Re-ranking: Sort retrieved results by true relevance, not just embedding similarity. A cross-encoder model scores each chunk. You retrieve top 20, re-rank, then use top 5. Re-ranking is expensive but often worth it; it improves quality more than retrieving more documents.
Scoring: Assign importance or priority to each chunk. Use a scoring function to decide whether to include, compress, or skip a document.
Compressing: Shorten documents while preserving key meaning. A 1000-token document becomes a 200-token summary. Saves context window space.
Combining: Merge results from multiple sources. User profile from CRM. Support history from Zendesk. Product docs from Confluence. All assembled into one coherent context.
Splitting: Distribute context across specialised models. One model for legal review. Another for market analysis. Another for technical evaluation. Each gets only what it needs.
Chunking: Break documents into pieces before storage (this happens at ingestion, not assembly, but it affects everything downstream). Too-small chunks miss context. Too-large chunks waste tokens and miss precision.
Chunk stitching: Reconstruct logical flow when you retrieve fragmented pieces. The model sees coherent narrative, not disjointed blurbs.
Templating: Wrap selected context in structured format. Add metadata headers: “Date: 2026-01-15, Author: Jane Smith, Confidence: High.” Makes the input easier to parse.

The product insight is this: your retrieval might find the right documents, but if assembly is poor, the model sees noise. Each assembly technique is a quality lever. Filtering reduces false positives. Re-ranking improves precision. Compression saves tokens. Your job is to sequence them intentionally.

The Surface Area You Now Own

Step back and look at what this article has covered. Instructions, examples, knowledge, memory, tools: five types of context competing for the same token budget. Retrieval techniques that determine which information even reaches the model. Assembly techniques that determine how it is structured when it arrives. Tool descriptions that determine which actions the model can take.

Every one of these decisions shapes what your user experiences. The model is the same for everyone. Frontier models are a commodity; your competitors can access the same ones you can. The context is what makes your product yours. It is where your domain knowledge, your user understanding, and your product judgment translate into quality that a competitor cannot replicate by switching to the same API.

This is why context engineering is a PM skill. Not because you need to write the retrieval code, but because you need to define what “relevant” means, decide which trade-offs to make when tokens are scarce, and know enough about the mechanics to have a useful opinion when someone suggests “we should just load the whole document in.”

Test Your Understanding

You have a 128k-token context window. Your system prompt is 1,000 tokens. You can retrieve 10 documents. Should you retrieve 10x1,000-token documents, or 100x100-token chunks, and why?
Your retrieval system returns semantically similar documents but the model still gives wrong answers. Name two assembly techniques that could help, and explain how.
You’re designing a tool set for an agent. You have access to 15 functions but the agent only needs 3. What is the context cost of exposing all 15, and what should you do?
A colleague argues that with a 200k-token window, you should forget about RAG and just stuff entire documents into the prompt. What are three counter-arguments?

The shift from prompt engineering to context engineering is a shift in where you invest your attention. Prompt engineering optimises the last thing the model sees. Context engineering optimises everything that reaches it first.

As models improve, they become better at extracting value from context. The ceiling on quality is no longer “can the model understand this prompt,” but “does the model have the right information to begin with.”

That’s your leverage now.

What’s Next

If you’ve followed the series to this point, you now have three layers of the framework in place.

Layer 1 gave you the mechanics: tokens, context windows, attention, the training/inference distinction. Layer 2 gave you measurement: golden datasets, eval hierarchies, regression suites, the eval flywheel. This article, the first half of Layer 3, gave you the application layer: the five types of context, retrieval and assembly as quality levers, and the recognition that these are product decisions, not engineering details.

What you don’t yet have is what to do when it breaks. The returns policy failure from the opening of this article is still undiagnosed. You know the context was wrong, but you don’t yet have a repeatable method for tracing the failure to a specific component and routing the fix to the right team.

That is the next article: Layer 3, Part 2. A diagnostic discipline for figuring out whether a failure lives in the model or your pipeline, and how to turn that diagnosis into a credible stakeholder conversation.