Method

Why the sample of twenty fails.

A human can read maybe twenty AI conversations a week. Most product decisions are made on that sample. Here's why the sample is biased, and what a team can do about it.

product decisionsAI evaluationconversation sampling5 min read

Locus teamApr 14, 2026

When a team needs to ship something on a product built on an AI agent, they read a few conversations and decide. Which conversations they read — and which they don't — determines what gets shipped.

The default method is a sample

Ask any PM with an AI product how they know what their users are doing. The honest answer is: we read some. Twenty conversations a week. Maybe fifty during a big review. A support rep flags the worst three. A designer opens a handful they remember. That's the loop.

The loop is fine, as far as it goes. It's also the reason product decisions are biased toward loud users, recent users, and users whose problems look like the last problem.

Three biases baked into manual sampling

Escalation bias. A human sees conversations that got escalated. They don't see the 400 conversations where the user gave up, didn't complain, and churned quietly next month.
Recency bias. A human remembers what happened yesterday. They don't remember the shift that started six weeks ago and crossed a threshold today.
Vividness bias. A human reads the strange, funny, or infuriating conversation. They don't read the 40 boring ones that describe the real median experience.

These biases are not the product team's fault. They're the mathematically inevitable result of a human trying to read a thousand-to-one sample ratio.

What the machine sees

A system that reads every conversation isn't smarter than a human reader on any single one. It's just not biased by which ones it sees. A pattern that's visible in 400 conversations but invisible in the 20 a human happened to open shows up immediately.

“You can read the five conversations that matter most if something else has read the other ten thousand first.”

The better method

Keep reading conversations. Humans are better than any model at reading one conversation deeply. Use the machine for the part humans are bad at: finding the five conversations that matter, out of the ten thousand that don't.

The machine reads everything, clusters by behaviour, flags what's shifting.
The human reads the five it surfaced, decides what to ship, and writes the memo.

That's the whole idea behind Locus. See it in the live sample or bring your own traces.

Why the sample of twenty fails.

The default method is a sample

Three biases baked into manual sampling

What the machine sees

The better method

Keep reading

What Locus is, and why we built it.

The data is already there. Your team just can't read it.