All writing
Cluster E · Comparisons

Observability vs evals vs product analytics.

AI agent observability, evals, and product analytics answer different questions. Here is what each layer measures, where it stops, and which one tells you if users got value.

Amadin Ahmed8 min readupdated May 4, 2026

Your AI agent has three layers of tooling watching it. Observability confirms the system ran. Evals confirm the model passed prepared cases. Product analytics confirms someone opened the app. None of these confirms the user got value. This is the gap that [product analytics for AI agents](/blog/what-is-locus) was built to close.

Most teams running a production AI agent have at least three monitoring tools in their stack. A typical setup looks like Datadog or OpenTelemetry for infrastructure, Langfuse or Braintrust or LangSmith for traces and evals, and Mixpanel or Amplitude for user-facing dashboards. Each tool does something well. The problem is the question none of them answers: what did the user actually get out of this? The data is already there, sitting in your trace store as free text nobody reads. And the groups that matter are behavioural, not plan-tier-based, because plan tier is not a behaviour. This post maps exactly what each layer does, where it stops, and which gap costs product teams the most. For the full vendor-by-vendor breakdown — every observability tool, every eval platform, and every product analytics extension in 2026 — see the LLM agent observability and product analytics landscape.

What does AI agent observability actually measure?

AI agent observability is the practice of collecting telemetry from a running system to confirm it is operating within defined bounds. Tools like Datadog, New Relic, Honeycomb, and the OpenTelemetry standard give you spans, traces, latencies, error rates, and throughput. If your agent makes a tool call that takes 4,200ms when the SLA is 3,000ms, observability catches it. If the LLM returns a 500, observability logs it. If the agent retries three times and eventually succeeds, observability shows you the retry chain.

What observability does not do is read the conversation itself. It knows the request happened. It does not know what the user typed, what they were trying to accomplish, or whether the output was useful. A trace can show you that an agent run completed in 2.1 seconds across four spans. It cannot show you that the user then re-did the work manually in Google Docs. In a 2024 survey by Datadog, 65% of teams running LLM applications in production cited lack of user-level insight as a top-three gap in their observability setup (illustrative).

Observability answers: did the system stay up and respond within bounds? That is necessary. It is not sufficient.

What do evals actually measure?

An eval is a test case run against a model to confirm it produces the expected output on a known input. Tools like Braintrust, OpenAI Evals, Ragas, LangSmith evaluations, and Langfuse scoring let you build a fixed dataset of prompt-response pairs and measure how the model performs against them on every deploy. If the model used to produce the right SQL query for a known question and now it does not, the eval catches the regression.

The limit of evals is that they measure a prepared world. They tell you the model performs on the cases you thought to test. They do not tell you what real users are asking for. A team at Anthropic noted in their model card documentation that eval suites are a necessary baseline but do not substitute for production behavioural monitoring. In practice, a passing eval suite and a 12% retention drop can coexist. The model did not regress. The users just started asking for something the eval set never covered.

Evals answer: did the model regress on prepared cases? That is necessary. It is not sufficient.

What does traditional product analytics measure for AI agents?

Traditional product analytics counts events: clicks, page-views, sessions, conversions. Mixpanel, Amplitude, and PostHog were built for SaaS apps where every interaction fires a named event. A user clicks a button. The button fires an event. The event goes into a warehouse. A chart counts it. Funnels, cohorts, retention curves, and A/B tests all sit on top of that event stream.

AI agents do not produce events in the same way. The user types a sentence. The agent reads it, thinks, maybe calls a tool, and writes back. There is no button called writing or code. There is no funnel with four steps. The whole interaction is one paragraph of free text followed by one paragraph of response. Amplitude can tell you the user opened the app. It cannot tell you the user asked for a deployment script, got a working one, and deployed it. Counting active users on an AI agent is like counting page-views on a phone call. The number goes up. It does not mean what you think it means.

Traditional product analytics answers: did they open the app? That is necessary. It is not sufficient.

Which question is nobody answering?

The question nobody answers is: did the user get value from what the agent produced? Not did the system run. Not did the model pass a test. Not did the user open the app. Did the user actually get the thing they came for, and did they act on it, and did they come back?

That question requires reading the conversation itself. It requires classifying what the user was trying to do. It requires noticing when a user accepts the output but then re-edits half of it in another tool. It requires tracking whether trust is growing or eroding week over week. No tool in the traditional stack was built for this. The structured telemetry layer (observability) does not read text. The model-quality layer (evals) does not read production traffic. The user-behaviour layer (product analytics) does not read conversations.

In a typical production agent, around 31% of completed runs end with the user editing more than half the output (illustrative, based on early Locus snapshot data). Those runs count as completed in your observability tool and as active sessions in your product analytics dashboard. They are not failures by any metric in the traditional stack. They are failures the user absorbed silently.

How do the three layers compare side by side?

  • Observability (Datadog, OpenTelemetry, New Relic, Honeycomb): measures system health. Tells you the agent ran, latency was in-band, tool calls succeeded. Does not tell you what the conversation was about or whether the user got value.
  • Evals (Braintrust, OpenAI Evals, Ragas, LangSmith, Langfuse scoring): measures model quality on a fixed test set. Tells you the model has not regressed on prepared cases. Does not tell you what real users are asking for in production.
  • Traditional product analytics (Mixpanel, Amplitude, PostHog): measures engagement events. Tells you users opened the app and how many sessions they had. Does not tell you what they used the agent for or whether they got what they came for.
  • Product analytics for AI agents (Locus): measures user value at the conversation layer. Reads every conversation, classifies intent, groups users by behaviour, tracks value and drift over time. Tells you what your users are actually doing and whether the agent is creating value.

The four layers are not competitive. Each watches a different thing. The problem is that most teams have three of the four and assume the picture is complete. It is not. The conversation layer, where user value actually lives, is the one that was missing until recently.

Is observability enough to measure AI agent success?

No. Observability is a necessary floor. Without it, you do not know if the system is up. With it alone, you know the system is up but have no idea whether users are getting value. A 99.9% uptime and a 40% silent-failure rate can coexist. The agent completes the run successfully from a system perspective while the user copies the output, opens a blank document, and rewrites it from scratch. Your APM dashboard says healthy. The user says otherwise, or more often, just stops showing up.

Are evals enough to measure user value from AI agents?

No. Evals are a necessary baseline. They catch regressions. They do not catch intent drift. If your users start asking for deployment help and your eval set only covers code generation, the model can score 98% on evals while an entire cohort of users leaves unsatisfied. The gap between what your eval set covers and what your users actually ask for grows silently over time. For a deeper look at why this happens, see why agents pass evals but still fail users. Only a layer that reads production conversations can see it.

How is product analytics different for AI agents than for SaaS?

In a traditional SaaS app, the user clicks buttons. Each click fires a named event. The analytics tool counts events and builds funnels. In an AI agent, the user types a sentence and the agent responds with a paragraph. There is no button to name. There is no funnel with four steps. The unit of behaviour is the conversation, not the event. This means the entire architecture of traditional product analytics, which is built on event schemas, does not apply. You need a tool that reads the conversation itself and extracts intent, value signals, and behavioural groups from the text. That is product analytics for AI agents.

What should a production AI team actually run?

A production AI team that wants the full picture needs all four layers. Here is the minimum viable stack:

  1. Observability for system health. Datadog, OpenTelemetry, or your existing APM. This is non-negotiable for any production service.
  2. Evals for model regression. Braintrust, LangSmith, or Langfuse. Run on every deploy. Catch regressions before they hit users.
  3. Traditional product analytics for engagement basics. Mixpanel, Amplitude, or PostHog. Know how many users are active and what their retention curve looks like.
  4. Product analytics for AI agents for the user-value layer. Locus reads from your existing trace store, classifies every conversation, groups users by behaviour, and tells you where value is being created and where it is silently eroding.

Most teams already have the first three. The fourth is what they are missing. It is the one that answers the question their CEO keeps asking: is the product actually working for users?

Frequently asked questions.

What is the difference between AI observability and AI product analytics?

AI observability watches the system. It confirms the agent ran, the tool call returned, and the latency was within SLA. AI product analytics reads the conversation itself and tells you what the user was trying to do, whether they got value, and how behaviour is changing. They sit at different layers of the stack and most production teams need both.

Can I use Langfuse or LangSmith as my product analytics tool?

Langfuse and LangSmith are trace stores built for engineering. They let you debug one response, inspect token usage, and score individual outputs. They do not group users by behaviour, classify intent across thousands of conversations, or track value drift over time. Locus reads from Langfuse and LangSmith and adds the product analytics layer on top.

Do I need evals if I have product analytics for AI agents?

Yes. Evals catch model regressions on every deploy. Product analytics for AI agents reads what real users are doing in production. They answer different questions. Evals prevent known failures from shipping. Product analytics shows you unknown failures you never thought to test for.

Why can Mixpanel or Amplitude not do this for AI agents?

Mixpanel, Amplitude, and PostHog count structured events. AI agents produce conversations, not clicks. One sentence from a user can encode a goal that would have been ten button clicks in a SaaS app. No event schema captures the difference between a user asking for a deployment script and a user asking for a code review. You need a tool that reads the text itself.

How does Locus fit into the stack alongside observability and evals?

Locus reads from the trace store you already use (Langfuse, Braintrust, LangSmith, OpenTelemetry, Datadog, OpenAI, or Anthropic). It does not replace any existing tool. It adds the conversation-level product view that none of the others provide. The data is already there. Locus is the layer that reads it.

What does it cost to add the product analytics layer?

The first memo, on a sample of up to 500 sanitized production runs, is free. No integration is required. You share a sample, Locus reads it, and you receive a product-facing memo within a week. After that, teams that want continuous reads move to a four-week pilot. Pricing depends on conversation volume.

Tagged
AI agent observability vs analyticsAI agent observabilityAI agent evalsproduct analytics for AI agentsLLM observabilityai agent metricsLangfuse alternativeBraintrust alternativeLangSmith alternativeai agent monitoringllm evaluation toolsai product analyticsmeasure AI agent valueagent evaluation tools
Done reading? Try Locus on your own runs

See what every user of your agent does.

Pick a time. We'll walk through what a snapshot would look like for your product, on your terms.