The Four-Layer Model: A PM's Framework for AI Product Quality

A legal AI tool summarises a Share Purchase Agreement. The summary is crisp and accurate. It flags an unusual warranty clause. But when it cites Hedley Byrne v Heller [2019] UKSC 14 as the governing authority, there’s a problem: Hedley Byrne was decided in 1964, and there is no 2019 Supreme Court case at that citation. The model invented it. The customer’s general counsel reads it back to you and asks a simple, devastating question: “Can we trust any of your tool’s output?”

You have one hour to know what went wrong.

This is where most PMs get stuck. They see the problem and they know it’s bad. But they don’t have the language to move between the different layers of causation fast enough to diagnose it, fix it, and explain it.

About 6 months ago, I decided to fix that gap for myself.

How this series came about

I’m a Product Manager working in AI, and I reached a point where I realised that surface-level familiarity with the technology wasn’t enough. I could talk about AI at a strategy level, but when a technical conversation got specific, when someone mentioned attention mechanisms, eval rubrics, or grounding failures, I didn’t always have the depth to push back or ask the right follow-up question. That bothered me.

So I started reading everything I could get my hands on. Pawel Huryn’s Product Compass became one of my most valuable sources, particularly his work on context engineering and AI agent architectures. His writing has a rare quality: it’s technically rigorous without losing the product lens. If you’re a PM working anywhere near AI and you’re not subscribed, fix that. Aakash Gupta’s Product Growth has been another consistent source of sharp thinking on where the PM role is heading as AI reshapes the discipline. Both of them are doing genuinely important work making this knowledge accessible to product people.

Beyond those two, I consumed research papers, Anthropic’s and OpenAI’s developer documentation, Simon Willison’s blog (consistently one of the sharpest voices on the practical realities of building with LLMs), Hamel Husain’s writing on evals (if you read one thing on AI evaluation, make it “Your AI Product Needs Evals”), and anything else I could find that helped me build a more complete picture. I captured all of it in an Obsidian vault that grew, over months, into a fairly comprehensive AI knowledge repository structured around a layered learning curriculum.

This series is that curriculum, rewritten for a wider audience. I’m publishing it because I think every PM working with AI needs this knowledge, and too much of it is scattered across technical papers, engineering blogs, and paywalled courses that assume you’re building models rather than building products with them.

The framework I landed on organises everything into four layers. It’s not the only way to structure this knowledge, but it’s the one that keeps proving useful in practice, because real AI product problems don’t stay in one layer. They cross all of them at once.

Why four layers?

Most PMs specialise in one layer and miss the others. An engineer might understand Layer 1 (how models work) deeply and miss Layer 2 (how to evaluate them). A quality leader might own Layer 2 without seeing Layer 3 (how architecture amplifies or suppresses failures). A compliance officer might own Layer 4 without understanding Layers 1 or 2, and so end up writing governance policies that don’t address the root causes.

The real power comes from moving between them. The fabricated case citation above won’t make sense if you only look at one layer. It requires understanding how next-token prediction works (Layer 1), why the eval suite missed it (Layer 2), what the prompt architecture did wrong (Layer 3), and what the customer’s GC actually needs to hear (Layer 4).

Here are the four layers, stacked from mechanics to governance:

┌────────────────────────────────────┐
│  Layer 4: Safety and Governance    │
│  Trust, regulation, incidents      │
├────────────────────────────────────┤
│  Layer 3: Product Architecture     │
│  RAG, prompts, guardrails          │
├────────────────────────────────────┤
│  Layer 2: Evaluation and Quality   │
│  Evals, regression, benchmarks     │
├────────────────────────────────────┤
│  Layer 1: How Models Work          │
│  Tokens, attention, inference      │
└────────────────────────────────────┘

Layer 1: How Models Actually Work

This is the foundation layer. You don’t need a PhD in transformers, but you need to understand the basic mechanics: what tokens are, why context matters, how attention works, how models generate text, and why the distinction between training and inference shapes product timelines.

Layer 1 is where you learn concepts like temperature (which controls whether a model produces the same output every time or samples from multiple possibilities), context windows (which limit how much information you can feed the model), fine-tuning (expensive but powerful) versus RAG (retrieval-augmented generation: asking the model to answer based on specific documents you feed it), and embeddings (the mathematical representation that makes semantic search possible).

When someone says “the model is hallucinating”, that’s Layer 1 language. It doesn’t explain why, but it names what happened at the token level: the model’s next-token prediction led it down a path that confabulated information not present in its input.

In this incident, the model’s Layer 1 failure was straightforward. When it drafted the words “the seller’s tortious liability is governed by”, the next token it predicted was a case citation. This is statistically probable: English legal prose trains models to expect citations in this position. The model had seen “Hedley Byrne v Heller” thousands of times in training data associated with tortious liability. It sampled that citation, then confabulated the year and court to match the format pattern it had learned: [YEAR] COURT ABBREVIATION NUMBER.

The grounding instruction (“never cite cases not in the source material”) was present in the prompt. But because of how attention works in transformers, an instruction buried in a long context has weaker weight than a deep prior learned across thousands of training examples. The model chose the prior.

Layer 2: Evaluation and Quality

This is the layer that separates PMs who control their own destiny from those who get surprised by customers.

Layer 2 is about defining quality rigorously enough that you can measure whether your model is actually doing what you promised. It covers precision, recall, and F1 scores; hallucination types (intrinsic: inventing information; extrinsic: confusing sources); the difference between faithfulness and factuality; how to build regression suites so you catch silent model degradation; and how to read benchmark claims without getting hoodwinked.

A faithfulness eval tests whether the model’s answer agrees with the source material it was given. The summary passed this test: the warranty clauses it described were accurate. The citation hallucination was not detected by a faithfulness eval because the eval doesn’t measure “did the model introduce an entity that wasn’t in the source”.

This is the biggest gap most PMs have. They focus on whether the output is good, not on whether the eval is measuring what matters. They measure happy paths and miss the failure modes. In this case, the eval suite detected accuracy, but it didn’t have a slice that tested “citation introduced outside source material”. That’s a rubric design failure.

Layer 2 is also where you track regression. Did the model work last month and break this week? If so, the provider silently updated the checkpoint, and that’s a governance conversation with your customer, not just a bug fix.

Layer 3: Product Architecture and Design Patterns

Layer 3 is how the actual product is built. It covers RAG pipelines (retrieval, re-ranking, prompt assembly), agents and tool use, the distinction between model-layer problems (the provider’s job) and app-layer problems (your job), trade-offs between latency, cost, and quality, and guardrails and output validation.

This is where you learn that a prompt rule isn’t an enforcement mechanism. The system prompt said “do NOT cite cases”. That’s Layer 3 work. But rules are suggestions to models. The model followed the suggestion most of the time, but on Friday, it decided to cite a case anyway. A rule without enforcement is a hope, not an architecture.

The fix is a guardrail: a post-generation step that detects citation patterns and verifies each one against a legal citator before returning the response. That’s enforcement. No citation, no output.

Layer 3 also teaches you how problems move through the pipeline. Retrieval gives you the right chunks? Good: the problem isn’t in the retrieval logic. The chunks are promoted correctly by re-ranking? Move on. The grounding instruction is in the prompt? Check. But is it at the top of the prompt (high attention) or buried in the middle (low attention)? That’s architecture and it matters.

Layer 4: Safety, Ethics, and Governance

Layer 4 is what you tell the customer’s general counsel, the regulator, and your board.

This layer covers alignment and constitutional AI (how models are made safe), failure mode classification (is this jailbreaking, prompt injection, sycophancy, or distributional shift?), bias and fairness, the regulatory landscape (EU AI Act, UK GDPR, US fragmentation), the difference between AI safety failures (a model making a mistake) and AI security failures (an attacker exploiting a model), and incident response.

In this scenario, the Layer 4 conversation is: “We’ve confirmed the issue. The model generated a citation not present in your document. This is a known failure mode. We mitigate it through three defences: a prompt rule, a post-generation verifier, and a regression eval suite. The verifier wasn’t yet enabled for this feature. We’re enabling it this week and pausing the feature in the meantime. You’ll have a full post-mortem in five working days.”

That’s not “we have a hallucination problem”. It’s not “we’re so sorry”. It’s: “Here’s where it failed, here’s how you can check we’ve fixed it, here’s our timeline, and here’s the evidence that we think about these failures systematically.”

Under the EU AI Act, a contract-review tool used by qualified lawyers sits close to the “high-risk” line. The regulator doesn’t expect perfection. It expects documented mitigation. A governance framework that shows you’ve thought about failure modes, tested for them, and have a response plan. The Layer 4 conversation is what saves the relationship.

Why It’s All Four Layers at Once

Here’s the critical insight, traced through the incident:

The Fabricated Citation: One Bug, Four Layers

Layer 1  │ Next-token prediction produced a statistically
(Model)  │ plausible case citation from training priors.
         │ Grounding instruction lost to stronger prior.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 2  │ Eval suite tested faithfulness but NOT
(Eval)   │ "introduced entities." Rubric gap.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 3  │ Prompt rule said "do NOT cite cases."
(Arch)   │ No enforcement guardrail behind it.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 4  │ Client's GC needs a credible incident response,
(Gov)    │ not "it's a known issue with LLMs."

Real production problems are cross-layer. You can’t diagnose this incident with Layer 1 knowledge alone. Yes, the model produced a confabulation, but so what? Why did it get to the customer? Because Layer 2 missed it. Why did Layer 2 miss it? Because the eval rubric had a gap. Why didn’t the gap matter less? Because Layer 3 had no enforcement layer. Why did the customer not drop you? Because Layer 4 had a governance response ready.

No single layer owns the fix. No single layer owns the blame.

The PM’s job is to move between layers fast enough that by the time you’re in the room with the customer, you’ve already diagnosed which layer each fix belongs to, what the timeline is for each fix, and which fix gets shipped first.

Consider a few other escalations:

“The AI is slower than it was last week.” Is the context window longer (Layer 1 attention cost)? Is there a regression that shows when the slowdown started (Layer 2)? Did the provider update the model (Layer 3 model-layer vs app-layer)? If it’s a silent update, check the customer contract for a notice requirement (Layer 4 governance).

“The eval suite passes but users say quality dropped.” Your eval has a gap (Layer 2). Interview users, extract the failure class, build a new test slice. Did the provider silently update the checkpoint (Layer 1)? Did any app-layer code change: a new prompt, a re-ranker threshold, a new guardrail (Layer 3)?

“We’re being attacked about hallucination rates on a public benchmark.” This is Layer 2 entirely: benchmark literacy. What dataset? What prompt? Are their numbers even comparable to yours? And Layer 4: the marketing response matters. Publish your own hallucination rate, defined precisely, on a held-out domain-specific benchmark.

Every one of these incidents touches multiple layers. If your trace doesn’t, you’re missing something.

Fluency as Movement

The differentiator isn’t knowing everything. It’s knowing what you know, knowing what you don’t, and being honest about the boundary. It’s being able to say: “I’d need to pull the eval logs to give you a precise number, but here’s how I’d think about whether precision is even the right metric for this use case.”

It’s moving from Layer 1 to Layer 2 to Layer 3 to Layer 4 and back again, fast enough that you can diagnose a problem before it becomes a customer crisis, fix it before your user has to call their lawyer, and communicate the fix in terms that matter to the people who depend on your product.

This is the fluency the series will build. It’s what I set out to learn when I started filling that Obsidian vault, and it’s what I want to make available to every PM who’s feeling the same gap I felt. Each subsequent article goes deep on one layer, plus a dedicated piece on context engineering, the cross-cutting discipline that determines what information reaches the model in the first place. This is the frame you’ll use to connect them.

Test Your Understanding

Before moving to the deeper layers, check yourself on these questions:

Layer 1 vs Layer 2: A model produces a grammatically perfect answer that’s factually wrong. Is this a Layer 1 problem, a Layer 2 problem, or both? What would you measure to tell the difference?
Layer 3 diagnosis: You have a system prompt that says “cite only sources from the provided context”. The model cites external sources anyway. Is this a prompt-writing problem or an architectural problem? What would you add to fix it?
Cross-layer incident: A feature works perfectly in the lab, passes all evals, and then fails on a customer’s data. Which layers would you check first and why?
Layer 4 communication: A customer asks, “Is this a bug or is this what LLMs just do?” What’s the difference you’re communicating in your answer, and which layers justify your answer?

The series ahead:

Layer 1: How Models Work — tokens, context windows, attention, and the mechanics that explain why models behave the way they do
Layer 2: The Eval Gap — why evaluation is the biggest gap most PMs have, and how to close it
Layer 3: Model-Layer vs App-Layer — the diagnostic question that determines who owns the fix
Context Engineering — the cross-layer discipline of controlling what information reaches the model
Layer 4: Safety and Governance — the conversations PMs keep dodging, and why they determine whether customers stay