SJ.
Triage: The Diagnostic Discipline That Separates Fixable From Fundamental

Triage: The Diagnostic Discipline That Separates Fixable From Fundamental

Layer 3 of the AI Fluency framework, Part 2: the diagnostic discipline. When your AI product breaks, how to figure out whether the model or your application is at fault, route the fix to the right team, and give stakeholders a credible timeline.

Steve James ·

Remember the returns policy failure from the previous article? Your AI support agent told a customer they had 90 days to return an item. Your actual policy is 30 days. The customer tried on day 45, got rejected by a human agent, and is now furious. You now know how the context pipeline should have been built: the five types of context, retrieval techniques, assembly decisions. What you don’t yet have is a way to use that knowledge when things go wrong. A repeatable method for figuring out where a failure lives, who owns the fix, and how long it will take.

Your engineering lead asks in standup: “Should we file this as a model limitation with the provider, or is this something we can fix?” Your instinct might be to defer. After this article, you won’t need to.

This is Layer 3, Part 2: diagnosis. Part 1 taught you how to design the application layer. This article teaches you what to do when it breaks. Together, they complete Layer 3 of the AI Fluency framework: Product Architecture.

The Reproduction Test

Here is the single most useful diagnostic question in AI product work:

Does this failure reproduce when you send the same prompt directly to the raw model, with no retrieval, no orchestration, no guardrails?

Open your provider’s API playground: Anthropic’s Workbench, OpenAI’s Playground, or the equivalent for whichever model you use. Strip out your RAG pipeline. Strip out your system prompt. Give the model the same question the user asked, with just the relevant context pasted in by hand. See what happens.

If the failure reproduces, it’s a model-layer problem. The model genuinely cannot do this task reliably, even with perfect context. Your options are limited: stronger prompt engineering, a more capable model, or escalation to the provider. These are slow fixes, measured in weeks to months.

If the failure disappears, it’s an application-layer problem. Something in your pipeline is breaking the interaction between the model and the user. Your retrieval is pulling the wrong documents. Your context assembly is burying the answer. Your output processing is mangling the response. These are fast fixes, measured in hours to days.

    User reports: "The AI got this wrong"


    Reproduce on the raw model?
    (Same prompt, no pipeline,
     direct API call)

           ┌────────┴────────┐
           │                 │
          YES                NO
           │                 │
           ▼                 ▼
     MODEL LAYER        APP LAYER
           │                 │
     Slow fix:          Fast fix:
     prompt iteration,  debug retrieval,
     switch model,      context assembly,
     escalate to        guardrails,
     provider           output processing

This is not a theoretical framework. It’s a practical test you can run in ten minutes. And the result changes everything about who works on the problem, how long it takes, and what you tell the customer.

What the Test Actually Looks Like

The reproduction test sounds simple, but doing it well requires some care. Here’s what to actually do.

First, grab the exact user query that failed. Not a paraphrase of it, the actual input. Failures are often sensitive to phrasing, and testing with a tidied-up version defeats the purpose.

Second, open the provider’s API playground: Anthropic’s Workbench or OpenAI’s Playground, whichever you use. The point is to bypass your entire application stack.

Third, paste in the user’s query along with the context the model should have had: the relevant policy document, the product specs, whatever the retrieval pipeline was supposed to provide. Paste it manually so you’ll know exactly what the model is seeing.

Fourth, run it. If the model gets it right with clean context and no pipeline, the pipeline is the problem. If it gets it wrong even with perfect context handed to it directly, the model is the problem.

This test has a secondary benefit: it builds your intuition for what the model can and cannot do. After running a dozen of these, you start to develop a sense for which failures are likely to be application-layer before you even test. That intuition is worth developing deliberately.

The Investigation: A Customer Support Agent Gone Wrong

Let’s walk through the returns policy example. Your AI support agent is telling customers they have 90 days to return items. Your actual policy is 30 days. A customer tried to return something on day 45, was told by a human agent that it was too late, and is now furious because “your AI told me I could.”

You run the reproduction test. You paste the customer’s question (“What’s the return window for this product?”) into the API console along with your returns policy document. The model answers correctly: 30 days. So the pipeline is broken somewhere. Now you need to find where.

Step 1: Is the returns policy in your knowledge base at all?

This sounds obvious, but check. Teams add product information, troubleshooting guides, and FAQ content to their knowledge bases, and sometimes miss the mundane operational documents. If your returns policy was never ingested, the model has no source of truth to draw from. It will fall back on its training data, which might include outdated versions of your policy, a competitor’s policy, or a generic “most retailers offer 90 days” inference. Fix: ingest the document. This takes hours.

Step 2: Is it being retrieved when customers ask about returns?

The document exists, but is it surfacing? You need to see what the retrieval pipeline actually returns for the customer’s query. How you do this depends on your stack. If you’re using a managed platform like LangSmith, Langfuse, or Arize, open the trace for the failed request and look at the retrieval step: it will show you exactly which chunks were pulled and in what order. If your team has built a custom pipeline, ask your engineer to point you to the retrieval logs or to run the query against the vector store directly so you can see the ranked results. Most vector databases (Pinecone, Weaviate, Qdrant, pgvector) have a console or API endpoint where you can run a similarity search and inspect what comes back. The key question is simple: when a customer asks about returns, does the returns policy appear in the top results? If it isn’t ranking in the top chunks, your retrieval is the bottleneck. Maybe the document’s title is “Operational Policies Q1 2026” and the query “can I return this?” doesn’t match it semantically. Maybe your chunking split the 30-day clause away from the surrounding context. Fix: adjust chunking boundaries, add metadata tags, or switch to hybrid retrieval. This takes days.

Step 3: Is the model ignoring the retrieved content?

The policy is being retrieved and it’s in the context, but the model is still saying 90 days. This is a grounding failure. The model is preferring what it learned in training over the document you gave it. Your grounding instructions aren’t assertive enough. Fix: strengthen the relevant part of your system prompt. Be explicit: “When answering questions about company policies, use only the provided documents. Do not infer or supplement with general knowledge.” This takes hours.

Step 4: Is the content in your knowledge base stale?

Everything is working correctly, and the model is faithfully using the retrieved document. But the document itself says 90 days because it’s from last year, before you changed the policy. This is a content management problem, not an AI problem. Fix: update the document and establish a refresh cadence for policy content. This takes days to fix and weeks to systematise.

Step 5: Is the model reasoning incorrectly even with correct, current content?

You’ve confirmed the right document is being retrieved, it’s current, and the grounding instructions are strong. But the model is still getting the answer wrong. Maybe the policy has a complex conditional structure (“30 days for electronics, 60 days for clothing, 14 days for perishables”) and the model is applying the wrong condition. This is a genuine model-layer limitation. Your options: add structured reasoning instructions to force step-by-step evaluation of which condition applies, switch to a more capable model, or restructure the policy document to make conditions unambiguous. This is the rarest outcome.

To recap the sequence: missing document, retrieval miss, grounding failure, stale content, genuine model limitation. In practice, steps 1 through 4 account for roughly 90% of failures that initially look like “the model is wrong.” Step 5 is the residual. Diagnosing it requires ruling out everything else first, which is why the sequence matters.

Why This Is a PM Skill, Not an Engineering Task

You might be thinking: isn’t this debugging? Shouldn’t the engineers handle it?

They should handle the fix. But the triage belongs to you, for three reasons.

First, you’re the one fielding the customer complaint or the stakeholder question. If you can’t diagnose the layer in the meeting where the problem is raised, you lose a day to “let me check with the team.” If you can, you give a credible answer on the spot. “This looks like a retrieval issue. We can investigate today and likely have a fix this week.” That sentence changes the temperature of the conversation.

Second, you own prioritisation. Knowing the layer tells you the cost of the fix, which determines where it sits in the backlog. Application-layer fixes are cheap and fast, which means they should rarely be deferred: if you can fix a retrieval miss in a day, there is no good reason to leave it in the backlog for a sprint. Model-layer problems are expensive and slow, which means deferral is sometimes the right call: you might work around the limitation, accept it as a known constraint, or invest in a model switch when the roadmap allows. That asymmetry is a product decision, and you’re the one making it.

Third, you’re the person who spots patterns. A single retrieval failure is a bug. Ten retrieval failures across different document types are an architectural problem that needs investment. You can only see that pattern if you’re doing triage consistently, not delegating every quality report to engineering with a “please investigate.”

The Stakeholder Conversation

The practical payoff of layer diagnosis is the conversation it enables.

Without it, you say: “We’re looking into the issue and will get back to you.” This is the default, and it’s weak. It communicates nothing about severity, timeline, or whether you understand the problem.

With it, you say: “We’ve traced this to a retrieval issue. Our pipeline isn’t surfacing the returns policy when customers ask about returns. The document is in the knowledge base, but the semantic match is poor because of how it’s titled and chunked. We can fix the chunking and add metadata tags this week. I’ll confirm when it’s deployed and we’ll monitor the same query class for a few days after.”

That is a different conversation. It names the layer, identifies the specific failure, scopes the fix, commits to a timeline, and describes the verification step. It’s the difference between forwarding a ticket and owning the problem. The diagnosis that makes it possible takes ten minutes.

When the Layer Isn’t Obvious

Not every failure falls cleanly into one bucket. Some common ambiguities and how to handle them:

The model gets it right sometimes and wrong sometimes. This is almost always application-layer. Inconsistent retrieval (the right document surfaces on some queries but not others), temperature set too high (the randomness parameter from Article 2), or context ordering issues (the answer is in the context but buried in the middle where the model attends poorly). Run the reproduction test multiple times with the exact same input. If the raw model is consistent but your app isn’t, the pipeline is introducing variance.

The model gets it mostly right but adds incorrect details. This is a grounding problem. The model is using the retrieved content as a starting point and then embellishing with training knowledge. Strengthen your grounding instructions and consider adding an output validation step that checks claims against the source documents.

The model refuses to answer when it should. Check your guardrails. Overly aggressive input filters or safety instructions can cause false refusals. This is application-layer: your constraints are too tight, not too loose.

The failure only appears at scale. Some issues only surface with high concurrency, long conversations, or large context windows. These are infrastructure-layer or application-layer issues (rate limiting, context truncation, session management), not model issues. Don’t blame the model for your orchestration.

Test Your Understanding

Here are three scenarios. For each one, identify the layer and draft the first two sentences of the message you’d send to the stakeholder who raised the issue.

Scenario 1. Your e-commerce product recommendation agent keeps suggesting items that are out of stock. Customers are clicking through and finding “unavailable” pages. When you test the same query on the raw model with current inventory data pasted in, it correctly excludes out-of-stock items.

Scenario 2. Your internal knowledge assistant is asked “What’s our parental leave policy?” and responds with a policy from 2019. Your HR team updated the policy in 2024. You test the raw model with the 2024 policy pasted in and it answers correctly.

Scenario 3. Your AI coding assistant is asked to refactor a complex recursive function into an iterative one. It produces code that compiles but has a subtle off-by-one error in the loop termination condition. You test the same prompt on the raw model with the same code context and it makes the same mistake.

The answers are less important than the discipline. In each case, the reproduction test tells you the layer. The layer tells you the fix. The fix tells you the timeline. And the timeline is what your stakeholder actually needs to hear.

What’s Next

This completes Layer 3: Product Architecture. Part 1 taught you how to design the application layer. This article taught you how to diagnose it when it breaks. Between the two, you can build an application layer that performs and debug it when it doesn’t. The core insight across both: most failures that look like model problems are actually application problems, and application problems are fixable, fast, and within your control.

Step back for a moment and consider what you now have. Layer 1 gave you enough mechanical understanding to hold your own in a technical conversation. Layer 2 gave you the measurement discipline to make evidence-based product decisions. Layer 3 gave you the design knowledge and diagnostic method to build and debug the application layer. Three layers in, you can sit in a room where someone reports “the AI got this wrong,” diagnose the layer in ten minutes, name the fix, estimate the timeline, and communicate all of that credibly. Six months ago, most of us would have said “let me check with the team.”

The final article in this series is Layer 4: Safety and Governance. You’ve built the product and you know how to debug it. Now, what happens when it fails in ways that matter beyond your backlog? When a regulator asks how your model was validated, when a customer discovers bias in your outputs, when an incident lands and your response determines whether the customer stays or leaves. That is what Layer 4 prepares you for.

Get new articles in your inbox.

Notes on AI strategy and product management, from Steve.

Subscribe