All writing
Cluster D · Drift & trust

Why AI agents pass evals but still fail users.

AI agents can score 95% on evals and still lose user trust in production. Here is why eval suites miss silent failure, shadow rework, and intent drift, and what to measure instead.

Amadin Ahmed8 min readupdated May 4, 2026

Your eval suite passes. Your latency is green. Your agent completes the run. And your users are quietly re-doing the work somewhere else. This is the gap between model quality and user value that [product analytics for AI agents](/blog/what-is-locus) was built to close.

Evals are the most important quality gate a production AI agent has. Tools like Braintrust, OpenAI Evals, Ragas, LangSmith, and Langfuse scoring let a team run hundreds of test cases on every deploy and catch regressions before they hit users. That is necessary. It is also insufficient. A team that treats a passing eval suite as proof of user value is making the same mistake as a team that treats a passing unit-test suite as proof that the product shipped the right feature. Evals test the model. Users test the product. The gap between those two tests is where silent failure lives, and it is the same gap that makes the sample of twenty dangerous. For a complete map of where every eval, observability, and product analytics tool stops short of measuring user value, see the LLM agent observability and product analytics landscape.

What do evals actually test?

An eval is a test case. It has a prompt, an expected output, and a scoring function. The scoring function compares the model's response to the expected output and produces a number. Run a hundred of these, aggregate the scores, and you have a regression gate. If the score drops after a model swap or a prompt change, the deploy is blocked. This is the standard workflow in Braintrust, LangSmith, and Langfuse.

The problem is in the word prepared. Every eval case was written by someone who imagined a user need, wrote a prompt that represents it, and wrote an answer that counts as correct. That is a test of the model's ability to handle the cases the team thought to test. It is not a test of the model's ability to handle what real users are actually asking for. In Anthropic's model card documentation, the team notes that eval suites are a necessary baseline but do not substitute for production monitoring. The distinction matters more than most teams realize.

What failure modes do evals miss in production?

There are five categories of failure that a passing eval suite cannot catch. Each one is invisible to the eval harness because it requires reading production conversations, not test cases.

  1. Intent drift. Your users start asking for things your eval set never covered. A coding agent's eval suite tests code generation. Users start asking for deployment help. The eval score stays at 96%. The deployment cohort's satisfaction drops to 40%. Evals cannot see the new intent because no one wrote a test case for it.
  2. [Shadow rework](/blog/what-is-shadow-rework). The user accepts the output. Then they open another tool and rewrite 60% of it. The agent's completion rate is 100%. The user got roughly 40% of the value they needed. In early Locus snapshot data, around 31% of completed runs ended with the user editing more than half the output (illustrative). The eval suite counts those as passes.
  3. Cohort-specific failure. The agent works well for one group and poorly for another. Writers get clean outputs. Researchers get shallow summaries. The aggregate eval score stays high because the Writers' test cases dominate the set. The Researcher cohort churns, and the aggregate metric barely moves.
  4. Trust erosion. A user who once accepted outputs without edits starts editing every response. Their trust has dropped. Their usage has not. Evals measure whether the model can produce the right output. Trust erosion measures whether the user believes it will.
  5. Context-dependent failure. The model produces a correct answer that the user cannot use because it misread the context of the conversation. A technically correct SQL query against the wrong table. A well-written email in the wrong tone. The eval's scoring function would pass it. The user closes the tab.

None of these show up in a Braintrust dashboard or a LangSmith eval run. They show up in the conversations themselves, which is why the data is already there but nobody is reading it.

Why is there a structural gap between eval scores and user value?

The gap is structural, not accidental. Evals hold the world fixed. Users do not. An eval suite freezes the prompt, the expected output, the scoring rubric, and the context. A real user brings a goal the team never anticipated, phrases it in a way the prompt template never saw, and judges the output against a standard the scoring function does not know about.

This is the same reason unit tests do not replace user research in traditional software. Tests confirm the code does what the developer intended. User research confirms the developer intended the right thing. For AI agents, the equivalent of user research is reading production conversations at scale. Not twenty a week. All of them. That is what product analytics for AI agents does.

There is also a coverage problem. Most eval suites have between 50 and 500 test cases. A production agent handles thousands of distinct intents per month. Even a well-maintained eval suite covers a small fraction of what users actually ask for. The uncovered fraction is where the failures accumulate.

What should you measure after the eval passes?

Once the eval confirms the model has not regressed, the next layer of measurement is the user-value layer. This is the layer that reads production conversations, not test cases, and extracts signals the eval cannot see.

  • Acceptance rate. What percentage of agent outputs does the user accept without modification? A drop from 72% to 58% over four weeks is a leading indicator of trust erosion.
  • Edit rate. When the user does modify the output, how much do they change? Editing a typo is different from rewriting the whole thing. The ratio matters.
  • Shadow rework. Does the user accept the output and then redo the work in another tool? This is the hardest signal to catch and the most valuable. It requires correlating the agent conversation with downstream actions.
  • Per-cohort trust drift. Is one behavioural group losing trust while the aggregate looks stable? A 6-point drop in the Researcher cohort, masked by steady Writer engagement, is the most common pattern Locus surfaces in early snapshots. For the mechanics of how to measure trust in AI agent outputs, see the trust signals post.
  • Intent coverage gap. What are users asking for that no eval case covers? This is the list of intents the team did not anticipate. It grows every week and nobody notices until a cohort churns.

These five signals live in the conversation content. They are not in spans, not in traces, not in latency histograms. They require a layer that reads conversations at scale. For a full breakdown of which tool does what, see AI observability vs evals vs product analytics.

What does a passing-eval, failing-user scenario look like?

Here is a concrete example. A team ships a customer-support agent. The eval suite covers 200 test cases: refund requests, shipping inquiries, password resets, account changes. The model scores 94% on every deploy. Latency averages 1.8 seconds. Error rate is 0.3%.

Three weeks after launch, users start asking the agent to explain billing discrepancies. The eval set has no billing-discrepancy cases. The model improvises. It produces responses that are polite, grammatically correct, and factually wrong about the billing logic. Users read the response, realize it is wrong, and open a support ticket with a human. The agent's completion rate stays at 100%. The eval score stays at 94%. The human support queue grows 22% in two weeks. Nobody connects the queue growth to the agent because the agent's dashboard says healthy. For the full breakdown of what to measure instead, see how to measure AI support agent success.

A system that reads every conversation would have caught this in week one. A new intent cluster, billing discrepancy, would have appeared. The acceptance rate for that cluster would have been near zero. The team would have known to add billing logic or route those queries to a human before the queue exploded.

How do you close the gap between evals and user value?

Keep running evals. They catch regressions and prevent known failures from shipping. But add the conversation layer on top. Here is the minimum stack for a production AI agent that wants to know both whether the model works and whether users get value:

  1. Evals on every deploy. Braintrust, LangSmith, Langfuse, OpenAI Evals, or Ragas. This is the regression gate. It stays.
  2. Observability for system health. Datadog, OpenTelemetry, or your existing APM. Latency, errors, throughput. Non-negotiable.
  3. Conversation-layer product analytics. A tool that reads every production conversation, classifies intent, measures acceptance, flags shadow rework, and tracks trust per cohort. This is the layer that tells you whether the eval's passing score translates to user value. This is what Locus does.

The first two layers tell you the model works and the system is up. The third layer tells you the user got what they came for. Without the third layer, a team can ship for months with green dashboards and declining user trust.

Frequently asked questions.

Why do AI agents pass evals but still fail users?

Evals test a model on prepared inputs with expected outputs. Real users bring intents the eval set never covered, phrase them in unexpected ways, and judge the output against standards the scoring function does not know about. The gap between the prepared world of evals and the shifting world of production is where user-facing failures accumulate silently.

What is shadow rework in AI products?

Shadow rework is when a user accepts an agent's output and then re-does the work in another tool. The agent counts the run as completed. The user got partial value at best. In early Locus data, around 31% of completed runs showed this pattern (illustrative). It is invisible to evals, observability, and traditional product analytics.

What is the difference between an AI agent completing a run and the user getting value?

A completed run means the system executed without error. User value means the person got what they came for and acted on it. A 100% completion rate and a 31% shadow-rework rate can coexist. Completed run is a system metric. User value is a product metric. Product analytics for AI agents measures the second one.

How can an AI agent succeed technically but fail the user?

When the response is grammatically correct, factually plausible, and delivered within SLA, but misses the user's actual intent. A well-written email in the wrong tone. A correct SQL query against the wrong table. Technical success and user value are measured at different layers. Evals and observability measure technical success. The conversation layer measures user value.

Are evals enough to measure user value from AI agents?

No. Evals are a necessary regression gate. They catch known failure modes before they ship. They cannot catch unknown failure modes, intent drift, shadow rework, or cohort-specific trust erosion. Those require reading production conversations at scale, which is what product analytics for AI agents provides on top of the eval layer.

Why are my AI agent metrics green while users are unhappy?

Because the metrics on the dashboard measure the system layer: latency, error rate, completion rate, eval score. The user layer, acceptance, trust, rework, and cohort drift, is not captured by any of those tools. Both layers can diverge for weeks before retention reflects the problem. Reading the conversations is the only way to see the divergence early.

Tagged
why AI agents pass evals but fail usersAI agent evalsAI agent silent failureai agent drift detectionproduction agent user valueai agent trust metricsshadow rework AI agentsAI agent acceptance rateai agent metricsai agent evaluationmeasure AI agent valuecompleted run vs user valueai product analyticsagent value visibility
Done reading? Try Locus on your own runs

See what every user of your agent does.

Pick a time. We'll walk through what a snapshot would look like for your product, on your terms.