The Eval Gap: Why Most AI Products Ship Blind
Why evaluation is the biggest genuine gap most product managers have. Teams ship AI features without knowing whether they work, discover regressions from customer complaints, and can't measure impact. This is Layer 2 of the AI Fluency framework.
Two days after shipping the latest model update to production, the complaints start trickling in. A customer’s workflow that was working fine last week is now returning hallucinated case citations. A legal summary that used to be precise is now adding unsupported caveats. A contract extraction task is occasionally returning fields in the wrong order.
You had no warning. No eval suite caught it. The model provider had silently updated their checkpoint and your product quietly degraded.
This is the norm, not the exception.
You shipped blind.
This is Layer 2 of the AI Fluency framework: evaluation. And it is the biggest gap most product managers have. Not because evals are mysterious or require a PhD in statistics. Because most teams don’t have them at all, or have them in fragmented, ad hoc forms that don’t catch real failures.
If Layer 1 (How Models Work) is about knowing what you’re building with, Layer 2 is about knowing whether it works.
You cannot make evidence-based product decisions without it. You cannot measure the impact of a model upgrade, a prompt change, or a retrieval improvement. You cannot set realistic expectations for stakeholders. And you will learn about regressions from angry users, not from a test that runs before deployment.
Why Evals Are a PM Skill, Not Just an Engineering Problem
The scarcest skill in AI product management right now is not prompt engineering. It is evaluation. And most PMs do not have it.
You are accountable for quality. If you don’t have evals, you are flying blind. That should make you uncomfortable.
When engineering says “the new retrieval pipeline looks better,” you need to ask: better according to what measurement? If they can’t answer quickly, you don’t have evals. When an executive asks whether upgrading the model is worth the cost, you need data. If you don’t have historical eval runs on your golden dataset, you’re guessing.
Evals are how you:
- Prioritise: should I invest in improving the prompt, the retrieval, or the model?
- Set expectations: “we’re 85% accurate on the golden dataset, with these specific failure modes”
- Make evidence-based decisions: this model change improved faithfulness by 3 percentage points, and here’s what that means for users
- Catch regressions before shipping: run evals before deployment, not after complaints arrive
- Compound quality over time: every bug that reaches production becomes a regression test, preventing it from happening again
Without evals, every claim about quality is opinion. With them, you have the data to defend or challenge it.
The Evaluation Hierarchy: Four Layers from Cheap to Reliable
Not all evals are equal. Think of them as a hierarchy from fast and cheap to slow and reliable:
▲ high reliability, high cost
╱ ╲
╱ ╲ Expert Review
╱ ╲ (days · £££ · gold standard)
╱───────╲
╱ ╲ Human Spot Check
╱ ╲ (hours · ££ · catches edge cases)
╱─────────────╲
╱ ╲ LLM-as-Judge
╱ ╲(minutes · £ · scalable, needs calibration)
╱───────────────────╲
╱ ╲ Automated Unit Tests
╱ ╲(seconds · ~free · narrow, brittle)
▼ low reliability, low cost
You don’t pick one. A healthy eval system uses all four, with expensive methods calibrating cheap ones.
-
Level 4: Automated Unit Tests. Seconds to run, nearly free. Test narrow, deterministic cases. Catch obvious format violations: does the output have the required JSON fields? Is the extracted date in the right format? These are your canary tests: they detect catastrophic failure quickly.
-
Level 3: LLM-as-Judge. Minutes to run, cheap at scale. A second language model scores the outputs from your production model. The judge reads the input, the model’s response, and optionally a reference answer, then scores on a specific dimension: faithfulness, relevance, tone, whatever you specify. Scales to thousands of examples. Requires calibration.
-
Level 2: Human Spot Checks. Hours to days to run, expensive. Humans read through a sample of outputs and score them. This is where you catch what automation misses. Catches edge cases, subjective judgments, domain-specific context that an LLM judge might miss.
-
Level 1: Expert Review. Days to weeks, most expensive, gold standard. For high-stakes failures or major releases, you bring in domain experts (lawyers, doctors, etc.) to comprehensively assess the output.
Expert review is also the rarest. Most companies simply cannot afford to keep domain specialists on retainer for ongoing eval work. A medical AI product needs clinicians to review outputs; a legal AI product needs practising lawyers. These people are expensive, scarce, and have day jobs that aren’t reviewing your model’s outputs. The companies that do manage to build sustained expert review into their eval process tend to develop a substantial quality moat. Their golden datasets are richer, their failure modes are better understood, and their products improve faster because they have ground truth that competitors are guessing at. If you can find a way to get even occasional expert review, even a few hours a quarter on your hardest cases, it is worth disproportionate investment.
A healthy production eval system runs Level 4 continuously (every commit, every model update). Level 3 runs at every deployment candidate. Level 2 runs periodically, maybe weekly or monthly. Level 1 runs for major releases or after quality incidents.
The Golden Dataset: Your Foundation
I learned this one the hard way. I built a tool that monitors Reddit threads across subreddits relevant to my product space, tracking commentary about my product and its competitors and scoring sentiment across four dimensions: usability, pricing, accuracy, and reliability. The scores feed an interactive dashboard and trigger alerts when sentiment shifts sharply.
The problem was consistency. The same type of comment would score differently across runs. A frustrated post about pricing might register as moderate negativity one day and severe the next. I spent time tuning the prompt, adjusting the scoring rubric, tweaking the model parameters. Nothing stuck.
The fix was embarrassingly simple: I didn’t have a golden dataset. The model had no baseline for what a “strongly negative usability comment” actually looked like versus a “mildly negative” one. I spent a few hours manually curating a set of real Reddit comments with hand-scored classifications across all four dimensions. Once the model had concrete examples to calibrate against, consistency improved dramatically.
That is the golden dataset in miniature. Without it, you are asking the model to invent a standard. With it, you are giving it one.
The golden dataset is the spine of your eval system. It is a curated set of (input, expected output) pairs representing the core tasks your product performs.
A good golden dataset has these properties:
- Representative: covers the distribution of real user queries, not just the easy cases that your model already handles well
- Diverse: includes edge cases, adversarial inputs, underrepresented scenarios that are likely to trip up the model
- High quality: expected outputs were produced or verified by domain experts, not pulled from model outputs
- Versioned: tracked in version control; every change is documented
- Sized right: large enough to be statistically meaningful (100-500 examples for most tasks) but small enough that you can manually review it
How do you build one? Start with production: collect real user queries (with privacy protections). Don’t use the model’s previous outputs as ground truth; that is circular reasoning. Instead, have domain experts produce or verify the correct outputs. Then deliberately add hard cases: queries designed to provoke hallucinations, ambiguous inputs, edge cases. Make sure you cover different task types, user segments, domains.
The anti-patterns are worth naming because teams fall into them constantly:
- Sampling only from easy cases: if your golden dataset over-represents queries the model already handles well, your evals will be optimistic and your failure modes invisible
- Using model outputs as ground truth: this is circular reasoning; the model confirms its own correctness
- Freezing the dataset: as user behaviour evolves, a static eval set becomes less representative; review and refresh it regularly
Once you have a golden dataset, it becomes your regression baseline. Every model update, every prompt change, every retrieval improvement gets evaluated against it before shipping.
LLM-as-Judge: How It Works and Why It Drifts
LLM-as-judge is the most important scalability mechanism in AI evaluation. Here is the pattern:
- Your production model generates a response to an input.
- A judge prompt sends the input, the response, and optionally the expected output or retrieved context to a judge model (typically a frontier model, something more capable than the model under evaluation).
- The judge returns a score, a classification, or a rationale.
That’s it. And it scales to thousands of examples at a fraction of the cost of human review.
But judge quality depends entirely on the judge prompt. Don’t ask for general “quality”; ask for specific dimensions. A good judge prompt:
- Evaluates one dimension at a time: faithfulness, completeness, schema compliance, or whatever matters most. Combining dimensions in a single prompt muddies the signal.
- Provides a rubric: the judge needs to know what constitutes a 1 vs. a 4, with concrete examples of each score level.
- Asks for chain-of-thought reasoning before scoring: this forces the judge to justify its rating, which reduces arbitrary scoring.
- Includes a reference answer when available: giving the judge the expected output makes comparison explicit rather than leaving it to infer what “correct” means.
Judge models also have known biases:
- Verbosity bias: rating longer outputs higher, independent of quality
- Position bias: favouring the first option presented in a comparison
- Self-enhancement bias: rating the model’s own outputs higher than they deserve
- Sycophancy: agreeing with the framing in the eval prompt without scrutiny
So you must calibrate. Compare judge scores to human scores on a sample of outputs. Measure agreement using Cohen’s Kappa or Spearman correlation. Identify systematic biases. Document the margin of error. For teams ready to go further, Parloa Labs’ research on Bayesian A/B testing for AI agents shows how hierarchical statistical models can account for variation across conversation types and reduce the risk of overinterpreting small samples.
The trap many teams fall into is treating LLM-as-judge as the entire eval system rather than one layer of it. It is tempting because it scales so easily: you can score thousands of outputs overnight for a few dollars’ worth of API calls. But scale is not the same as reliability. An LLM judge is still a language model, with all the failure modes that implies. It can confidently score an output as faithful when a domain expert would immediately spot the hallucination. It can miss subtle reasoning errors that a human reviewer catches in seconds. It can systematically overlook an entire class of failure because nothing in the judge prompt draws attention to it.
LLM-as-judge works best as a screening layer: fast, cheap, and good enough to flag candidates for closer inspection. It does not replace human review; it reduces how much human review you need. Teams that skip the calibration step, or that never run human spot checks against their judge scores, end up with an eval system that tells them everything is fine while quality quietly degrades. A calibrated LLM judge is powerful. An uncalibrated one is a false sense of security.
The Eval Flywheel: How Quality Compounds
This is the pattern that separates mature AI teams from ones that ship blind:
Ship feature ──► Collect feedback ──► Add hard cases
▲ & production to golden
│ samples dataset
│ │
│ ▼
Ship fix with ◄── Identify & fix ◄── Run evals
new regression failures on golden
test dataset
Every time you complete this cycle, your product gets better and your eval suite gets more comprehensive. The golden dataset grows. The regression suite grows. Your understanding of failure modes deepens. Quality compounds over time.
Teams that skip this cycle ship features once and hope for the best. Teams that follow it ship features, learn from real-world behaviour, add the hard cases that tripped them up to the eval suite, and prevent the same failure from happening again. After six months, the difference in product quality is dramatic.
But this only works if evals are fast enough to run on a cadence. If running evals takes three days and requires booking someone’s calendar, it won’t happen. Build for speed. Your regression suite should run in under 30 minutes. It should be runnable as a pre-deployment gate.
Regression Suites: Treating Every Bug as a Test
This is not a new idea. High-performing software teams have used automated regression suites for years: every bug that reaches production becomes a test case, and the suite runs before every deployment to make sure you never ship the same failure twice. The principle is identical for AI products. The difference is what you are testing: not whether code executes correctly, but whether model outputs meet a quality bar.
Every major bug that reaches production should become a regression test.
A regression test is a specific example from your golden dataset (or a new example you create) designed to catch a particular failure mode. When a user reports a quality issue, your team does root cause analysis. Was it a prompt problem? A retrieval problem? A model problem? Once you know the cause, you create a test case that reproduces the failure and add it to the regression suite.
Then you fix the issue, run the test to confirm the fix works, and leave the test in the suite permanently. Next time you update the model, change the prompt, or modify the retrieval pipeline, that test runs automatically. If the regression suite fails, the deployment is blocked. This is your CI/CD gate for AI quality.
This is critical because AI regression happens silently. Model providers update their models without notice. A checkpoint update under the same name can degrade your product without any code change on your side. A prompt modification intended to improve one dimension might accidentally hurt another. A retrieval pipeline change might start returning worse documents.
If you run evals only at release time, you will miss degradations that happen between releases. Run the regression suite on a schedule: weekly, fortnightly, monthly, depending on your risk tolerance. If a regression suite fails, treat it as a production incident. Figure out what changed, whether to rollback or fix, and communicate clearly to stakeholders: which capability degraded, how severe, who is affected, what you are doing about it.
How to Start: The Minimum Viable Eval Setup
You do not need perfection to get started. You need enough signal to catch obvious regressions.
Pick your core task. For a legal AI product, it might be contract clause extraction. For a research assistant, it might be answer quality on a specific domain. For a customer service bot, it might be whether the response is safe and on-topic.
Create a golden dataset of 50-100 examples. Real user queries where possible. Have one or two domain experts produce or verify the correct outputs. Include at least 10-15 deliberately hard cases: adversarial inputs, edge cases, things the model might hallucinate about.
Write a judge prompt for one key evaluation dimension: faithfulness, correctness, format compliance, whatever is most critical for your use case. Run the judge on your golden dataset. Compare the judge’s scores to a sample of human scores. Document the calibration.
Create a regression suite of 20-30 cases: your most important examples plus any cases where the model previously failed.
Add this to your CI/CD pipeline. Run it before every deployment. If any dimension drops below your threshold, block the deployment and investigate.
You also don’t need to build all of this from scratch. Several platforms exist specifically for LLM evaluation and can accelerate setup significantly. Braintrust offers managed eval pipelines with deployment gating and a generous free tier. Langfuse is open source and self-hostable, with built-in support for LLM-as-judge scoring, dataset versioning, and human annotation queues. Arize Phoenix is strong on agent-level tracing and observability. For teams that want a CLI-first approach, Promptfoo is lightweight and good for red-teaming. The right choice depends on your stack and whether you need managed infrastructure or prefer to self-host, but any of them will get you running evals faster than building a bespoke pipeline from zero. This space moves quickly; these recommendations are accurate as of April 2026.
That is enough to start. From there, the flywheel kicks in. Every failure that reaches production becomes a new regression test. Your golden dataset grows. Your evals get richer. Within three months, you have a system that catches 80% of regressions before they reach users.
Test Your Understanding
-
Your team just upgraded to a newer version of your production model. What should happen before that change goes to production, and why?
-
You have an LLM judge that scores your outputs on faithfulness. Human review of a sample shows the judge is consistently 15 percentage points more generous than humans. What does this tell you, and what do you do?
-
Your golden dataset includes 500 examples. You run evals and see a 94% pass rate. Is that good? What additional information do you need before concluding the model is production-ready?
-
A product manager on another team argues they don’t need to invest in evals because “the model seems to be working fine.” How do you make the case for eval infrastructure? What specific risks does skipping evals introduce?
Connection to the Framework
This is Layer 2 of the four-layer AI Fluency framework introduced in the first article in this series. Layer 1 gave you the mental model for how inference, context windows, and training shape what a model can and cannot do. Layer 2 is how you turn that understanding into measurement. If Layer 1 tells you that a model’s behaviour is probabilistic and sensitive to context, Layer 2 is the discipline that detects when that behaviour drifts in your product.
Together, they form the foundation for evidence-based decisions about AI products: you understand the machinery, and you can measure whether it is working.
The next layer covers the architecture decisions that actually deliver quality. I’m planning to split this layer across two articles: first, how you engineer context to get reliable results in your specific domain; then, how you diagnose whether a problem lives in the model or in your application.