All writing
Cluster D · Drift & trust

What is shadow rework in AI products?

Shadow rework is when a user accepts an AI agent's output then redoes the work elsewhere. Invisible to evals, traces, and dashboards. Here is how to detect it.

Amadin Ahmed8 min readupdated May 4, 2026

Your agent completed the run. The user clicked accept. Then they opened another tool and rewrote half of it. That pattern has a name: shadow rework. It is the most common silent failure mode in production AI agents, and it is invisible to every tool in your stack except the one that reads conversations. This is one of the core patterns [product analytics for AI agents](/blog/what-is-locus) was built to surface.

Shadow rework is one of those problems that only exists because AI agents changed the shape of failure. In traditional software, failure is visible. The button does not work. The page throws an error. The user calls support. With AI agents, failure can look exactly like success. The agent responds. The user reads it. The user clicks accept. Every metric in your stack says the interaction went well. Then the user opens a text editor, a spreadsheet, or a Slack thread and rewrites the thing the agent just wrote. That rewrite is the rework. The shadow part is that nobody on the product team sees it happen. This gap between system completion and user value is the same structural problem behind why AI agents pass evals but still fail users.

What exactly is shadow rework?

Shadow rework is a post-acceptance failure pattern in AI products. The user receives the agent's output, signals acceptance (clicking a button, copying the text, or simply moving on), and then performs additional work outside the agent to fix, rewrite, or redo what the agent produced. The agent's logs show a completed run. The user's actual workflow shows partial failure.

The name matters because the pattern is distinct from rejection. When a user rejects an output, you see it in your acceptance-rate metric. When a user reworks an output in shadow, your acceptance rate stays high and your actual value delivery drops. That disconnect is what makes it dangerous. A team tracking only completion and acceptance can run for months believing their agent works while 30% of users are doing double the work. In support agents, the equivalent pattern is a deflected ticket that triggers a re-contact. For the support-specific version, see how to measure AI support agent success.

Why is shadow rework invisible to most tools?

Every layer of the standard AI stack stops at the moment the user accepts the output. Observability tools like Datadog and OpenTelemetry record that the system responded within SLA. Trace stores like Langfuse, Braintrust, and LangSmith log the prompt and the response. Eval tools like Braintrust evals and OpenAI Evals score the model on prepared cases. Product analytics tools like Mixpanel and Amplitude count that the session happened. None of them follow the user past the accept button.

That boundary exists because each tool was built for a world where the product's output is an action, not a paragraph. When a SaaS app processes a payment, the process either succeeds or fails. There is no shadow version. When an AI agent writes an email draft, the draft can be technically correct, tonally wrong, and rewritten by hand. The system sees success. The user experienced failure. The data that would prove the failure lives in what the user did after the conversation ended, and the data is already there in most cases. It is just not being read.

What does shadow rework look like in production?

Here are three concrete patterns from early Locus snapshot data (illustrative, based on anonymized production runs).

  1. The accepted-then-edited draft. A writing agent produces a customer email. The user clicks accept. Within four minutes, the same user pastes the email into Gmail and rewrites the opening paragraph, changes the tone from formal to conversational, and removes a sentence the agent hallucinated. The agent's log says completed. The user did 40% of the work again.
  2. The code that compiled but did not ship. A coding agent generates a database migration. The tests pass. The user commits it. Two hours later, the user rewrites the migration by hand because the agent's version did not handle a null-column edge case the eval set never tested. The agent's completion rate: 100%. The code that shipped: 0% agent-written.
  3. The summary nobody used. A research agent produces a five-paragraph summary of a competitor's pricing page. The user reads it, opens the pricing page directly, and writes their own notes. The agent saved zero time. The session counted as a completed run.

In each case, the standard metrics say the agent worked. Completion rate, latency, eval score, even thumbs-up rating would all look green. The rework is visible only if you track what the user did with the output after the conversation ended.

How do you detect shadow rework?

Detecting shadow rework requires correlating the agent conversation with the user's downstream actions. This is harder than counting clicks, but possible with the signals most production agents already log. There are four approaches, in order from simplest to most complete.

  1. Edit-rate tracking. If your agent has an inline editor (a code agent with a diff view, a writing agent with a text box), measure how much the user changes the output before committing. An edit rate above 50% on a "completed" run is a strong shadow-rework signal. In early Locus data, 31% of completed runs showed edit rates above 50% (illustrative).
  2. Time-to-next-action. If the user accepts the output and then immediately starts a new session on the same topic, the first output likely did not land. A gap under 5 minutes between accept and retry correlates with rework at roughly 0.6 (illustrative).
  3. Downstream-tool correlation. If you can see what the user does in adjacent tools (email sent, code committed, document saved), compare the agent's output to the final artifact. A semantic similarity below 0.7 between the agent output and the shipped artifact is a rework flag. This is the most reliable signal and the hardest to instrument manually.
  4. Conversation-layer analysis. Read every conversation the agent had, classify the user's post-output behaviour (accepted as-is, edited lightly, edited heavily, abandoned, retried), and roll it up per cohort. This is what product analytics for AI agents does. It is the only approach that scales past a hundred conversations a week.

Most teams start with approach 1 or 2 and graduate to approach 4 when the volume makes manual correlation impossible. The important thing is to start measuring it at all. A team that does not track shadow rework is flying with a broken altimeter.

Why does shadow rework matter for product decisions?

Shadow rework distorts every metric a product team uses to make decisions. Completion rate looks high. Usage looks healthy. The agent appears to be working. But the user is doing double the work, and the product team is building a roadmap on false signal. When retention eventually drops, the team has no leading indicator to explain why.

Here is the arithmetic. If 31% of completed runs involve shadow rework, and the average rework replaces 55% of the agent's output (illustrative), then the agent is delivering about 83% of the value the completion rate implies. That 17% gap is real time that users spend redoing work the agent was supposed to handle. Over a month with 10,000 completed runs, that is 1,700 runs where the user paid a time cost the product team never saw.

Shadow rework also concentrates in specific behavioural cohorts. Writers might rework 45% of outputs while code-first users rework 18%. The aggregate rate masks the cohort-level problem. The writers churn. The product team blames retention. They never trace it back to the fact that the agent's writing quality for that specific cohort was not good enough to use without rework.

What should a product team do about shadow rework?

The first step is to measure it. You cannot fix a problem you cannot see. Add at least one rework signal (edit rate, time-to-retry, or downstream comparison) to the post-deploy dashboard. If you already have a trace store like Langfuse, Braintrust, or LangSmith, the conversation data is there. It just needs a read layer on top.

The second step is to segment by cohort. Shadow rework is rarely uniform. Some groups of users rework heavily. Others accept outputs with minimal edits. The improvement that helps the high-rework cohort is usually different from the improvement that helps the low-rework cohort. Building for the aggregate means building for neither.

The third step is to treat rework rate as a product metric, not an engineering metric. Engineering owns the eval score and the latency budget. Product owns whether the user got value. Shadow rework sits squarely on the product side. It belongs in the weekly product review, next to retention and engagement. For a full picture of which tool measures what layer, see AI observability vs evals vs product analytics.

Frequently asked questions.

What is shadow rework in AI products?

Shadow rework is when a user accepts an AI agent's output and then redoes the work in another tool. The agent records a successful completion. The user got partial value at best. It is called shadow rework because it is invisible to the agent's own metrics, to observability tools, and to evals.

How common is shadow rework in production AI agents?

In early Locus snapshot data, around 31% of completed runs showed the user editing more than half the output (illustrative). The rate varies by cohort and use case. Writing-heavy cohorts tend to rework more than code-heavy cohorts. Teams that do not measure it typically underestimate the rate by a factor of 2 to 3.

How is shadow rework different from rejection?

Rejection is visible. The user clicks thumbs-down, regenerates, or abandons the conversation. Shadow rework is invisible. The user accepts the output and reworks it elsewhere. Rejection shows up in acceptance-rate dashboards. Shadow rework does not. Both reduce user value, but only rejection appears in standard metrics.

Can evals detect shadow rework?

No. Evals test model output against a prepared expected answer. They do not track what the user does after the output is delivered. An eval can confirm the model produced a correct response. It cannot confirm the user used that response without modification. Shadow rework is a post-output phenomenon that requires a different measurement layer. For more on what evals miss, see why AI agents pass evals but still fail users.

What metrics should I track to catch shadow rework?

Start with edit rate (how much the user changes the output before committing) and time-to-next-action (how quickly the user starts a new session on the same topic). If you can access downstream artifacts, add output-to-shipped similarity. For a complete conversation-layer view that catches shadow rework, product analytics for AI agents reads every conversation and flags rework patterns per cohort.

Does shadow rework affect retention?

Yes, but with a delay. Users who shadow-rework consistently are getting less value per session than their completion rate suggests. Trust erodes over weeks. When they stop using the agent, the retention curve drops, but the product team has no signal linking the drop to the rework that preceded it. Tracking rework rate gives the team a leading indicator that arrives weeks before the retention signal. For a full breakdown of how to measure trust in AI agent outputs, see the companion post on trust signals.

Tagged
shadow rework AI agentsAI agent silent failurecompleted run vs user valueAI agent acceptance rateai agent trust metricsai agent drift detectionai agent edit rateproduction agent user valueai product analyticsmeasure AI agent valueAI agent metricsagent value visibilityai agent abandonment
Done reading? Try Locus on your own runs

See what every user of your agent does.

Pick a time. We'll walk through what a snapshot would look like for your product, on your terms.