All writing
Cluster A · The data layer

The data is already there. Your team just can't read it.

Every AI agent logs every conversation. The volume is unreadable by hand and no traditional analytics tool can parse free text. Here's what that costs product teams and how to fix it with product analytics for AI agents.

Amadin Ahmed5 min readupdated May 4, 2026

Your AI agent had 94,000 conversations last month. Your team read 212. That gap is not a reading problem — it is a category problem. Almost no tool in your stack is built to read free text. This is the missing layer that [product analytics for AI agents](/blog/what-is-locus) was built to provide.

Every team running an AI agent in production has the same blind spot. The agent runs. The conversations log. Latency stays in band, evals pass, retention looks fine. The team still cannot answer the one question their CEO keeps asking — what are users actually doing with this thing? The reason is not effort. It is that the answer lives in plain text, and plain text does not fit any of the tools they already pay for. We covered the user-side of this gap in why the sample of twenty fails; this post is the system-side.

What is product analytics for AI agents?

Product analytics for AI agents is the layer that reads every conversation a production agent has had, classifies it by intent, groups users by behaviour, and reports the picture at the product level rather than at the trace level. It is the layer between observability (which watches the system) and evals (which score the model) — the one that answers what the *user* did with the output, not what the *agent* did to produce it.

Tools like Datadog, OpenTelemetry, Langfuse, Braintrust, and LangSmith are excellent at the layers they were built for — telemetry, traces, debugging, regression. None of them are built to read what a conversation was *about*. Product analytics tools like Mixpanel, Amplitude, and PostHog count clicks. There are no clicks in an agent loop. Conversations are paragraphs with intent, hedging, and repair. They do not fit into a fact table.

Why are AI agent metrics green while users are unhappy?

Because the metrics are reading the wrong layer. Every AI agent writes two kinds of output. The structured part — tool calls, retries, latency, status codes — lands cleanly in your observability stack. The unstructured part — the conversation itself, the thing the user actually typed — lands in a trace store and stops there. Your analytics warehouse never sees it. So the dashboards stay green while the user gives up, edits half the output, or quietly stops coming back. That editing pattern is shadow rework: the user accepts, then redoes the work elsewhere. Value erodes weeks before any leading indicator your stack tracks reflects it.

This is not a bug in any of those tools. Analytics tools are built to count events. A conversation is not an event. It is a paragraph, and no database schema can describe it.

What does each layer of the AI stack actually measure?

  • Observability tools (Datadog, New Relic, OpenTelemetry, Honeycomb) check whether the system stayed up and responded inside the SLA. They cannot read what it said.
  • Trace stores (Langfuse, Braintrust, LangSmith, Helicone) let an engineer debug one response. They are not designed to summarize ten thousand conversations a month at the product layer.
  • Eval tools (OpenAI evals, Braintrust evals, Ragas) score model output on a fixed test set. They tell you the model passed prepared cases. They do not tell you what real users are asking for in production.
  • Product analytics dashboards (Mixpanel, Amplitude, PostHog) count clicks, sessions, conversions. An AI agent has none of these. Counting active users on an agent product is like counting page-views on a phone call.
  • [Product analytics for AI agents](/blog/what-is-locus) is the layer the others all skip — what the user was trying to do, whether they got it, and how that is changing week over week.

None of these tools is wrong. Each was built for a category of data that existed before AI agents. None of them was built for the thing AI agents produce most: plain-language conversations, at scale. For a vendor-by-vendor map of every tool in each category as of mid-2026, see the LLM agent observability and product analytics landscape.

What does a product team actually lose when free text is unreadable?

A product team with a traditional analytics stack can tell you what users clicked. A product team with an AI agent often cannot tell you what users asked for. The difference is everything — one is about the product's surface, the other is about what the product was used for. Without that read, the team is forced to make decisions on a sample of twenty conversations a week out of ten thousand. The bias of that sample, and what happens to product decisions that are made on it, is what why the sample of twenty fails is about.

A second, quieter loss: the team cannot see behavioural groups. Plan tier and signup country tell you nothing about what users do. The groups that matter — Writers, Researchers, Code-first, Analysts — emerge only from the conversations themselves. If you cannot read the conversations, those groups are invisible.

The output of an AI agent is a conversation. The output of our analytics stack is a number. There is no tool that turns one into the other.
A PM, last quarter

How do I instrument an AI agent for product insights?

You do not need new instrumentation. The data is already in your trace store — most teams have months of conversations they have never read. The work is reading the pile, classifying every conversation by intent, grouping users by behaviour, and rolling that up to the product layer. Locus does this on a sample of your existing traces. The first read is free.

If you want a concrete walkthrough of what the read looks like in practice, see the Agent Value Snapshot in the pillar post. To see your own data, book a thirty-minute call. We pull a sample from whatever trace store you already use — Langfuse, Braintrust, LangSmith, OpenTelemetry, Datadog, OpenAI, or Anthropic — and produce your first memo within a week.

Frequently asked questions.

What is the difference between AI observability and AI product analytics?

Observability watches the system. It tells you the agent ran, the tool call returned, the latency was inside SLA. Product analytics for AI agents reads the *content* of every conversation and tells you what the user was trying to do, whether they got value, and how their behaviour is changing. They sit at different layers. Most production teams need both. For a full breakdown, see AI observability vs evals vs product analytics.

Are evals enough to measure user value from AI agents?

No. Evals score the model on a fixed test set. They confirm the model has not regressed on prepared cases. They do not tell you what real users are asking for or whether they are acting on the output. A passing eval suite and a 12% retention drop can coexist. You need a layer that reads production conversations.

Can Mixpanel or Amplitude be used for AI agent analytics?

Not effectively. Mixpanel, Amplitude, and PostHog were built for click and event analytics. AI agents do not produce clicks — they produce conversations, where one paragraph from the user can encode a goal that would have been ten clicks in a SaaS app. Counting events on an AI product captures the surface, not the use.

How much data do I need before product analytics for an AI agent is useful?

Around two thousand conversations a month is the practical floor. Below that, behavioural groups are not stable and reading by hand is more reliable. Above two thousand, the patterns cohere and a product analytics layer pays off. For vertical-specific guidance on support agents, see how to measure AI support agent success.

Where does Locus read from?

From the trace store you already use. Locus reads OpenTelemetry, Langfuse, Braintrust, LangSmith, Datadog, OpenAI, and Anthropic. There is no new SDK to install and no change to your application code. Engineering does not have to ship anything.

Tagged
ai product analyticsllm conversation analyticsai agent metricsai agent kpisai agent observabilityai product intelligenceagent value visibilityproduction agent user valueai agent silent failurellm app metricsAI PM toolsLangfuse alternativeBraintrust alternative
Done reading? Try Locus on your own runs

See what every user of your agent does.

Pick a time. We'll walk through what a snapshot would look like for your product, on your terms.