How versim measures fidelity. 17.80 / 20 from a fair test.

17.80/ 20

grounded persona

10.60/ 20

same model, no grounding

14 of 15 questions preferred by an anonymous judge. +68% total quality lift over the ungrounded baseline.

What fidelity means here

A synthetic persona should sound like a specific real person. Fidelity is how close that imitation gets. We measure it across four axes, each scored 0 to 5 by an anonymous judge that doesn't see which side it's grading. Sum is 0 to 20.

Speaking-style match. Does the persona use the same voice as the real source? Formal or casual. Terse or verbose. Helpful or sarcastic.
Specific knowledge. Does the persona mention the same tools, projects, and references the real person actually uses?
Voice signature. Does the persona use the same idioms, emoji, code-switching, and signature humor as the real source?
Factual accuracy. Does the persona stay consistent with known facts about the real person? Job, holdings, opinions, stack.

How we test

We pick a real person from the source community. We build a grounded persona for them from their distilled dossier. We also build a naive persona using the same model, told only: "You are a senior X based in Y with N years of experience." Both personas answer the same 15 questions. An anonymous judge scores both.

The 15 questions

5 set-aside real channel turns. Questions the real person actually answered in the source community. Their real answer is the reference. The personas never saw these during build.
5 domain-specific. Questions that touch the person's known interests beyond their day job (investing, expat life, gaming, etc.).
5 generic. Any-domain questions that any senior in the field could answer.

The judge

Same LLM family as the personas, with the labels hidden. It sees Response A and Response B only. It scores each response on the four axes. It picks a winner per question.

The cost

$0.30 of API spend. About 8 minutes wall time. The eval is small and cheap on purpose. We want anyone to be able to reproduce it.

Where the gap is

Naive personas can fake the speaking style. They can guess at the domain. They can't fake the voice.

speaking-style match · 0–5

grounded

4.80

ungrounded

3.13

specific knowledge · 0–5

grounded

3.53

ungrounded

2.40

voice signature · 0–5 (the biggest gap)

grounded

4.73

ungrounded

1.80

factual accuracy · 0–5

grounded

4.73

ungrounded

3.27

Where naive fails worst: voice signature. 4.73 vs 1.80 out of 5. Naive personas can put on a hat. They can sketch what a senior engineer cares about. They can't sound like a specific person with specific holdings, specific idioms, and specific signature humor.

One example, illustrative

The contrast pattern below is what we see consistently in the fair test. We don't reproduce real test responses on this page because they identify the source community. This example is invented, with the same shape as what the test actually surfaces.

The question

Why are you still on Docker Swarm in 2026?

Grounded persona

Because it does what we need without three days of yak-shaving every quarter. We're five people running about 40 containers. Not Netflix. Helm charts and operators are answers to problems we don't have. I've done two K8s migrations at previous jobs. Both took longer than our entire Swarm setup ever did. Yes, the ecosystem moved on. No, I don't care. 🤷

Naive persona

Most teams have moved to Kubernetes by now. Docker Swarm is essentially in maintenance mode and the broader ecosystem has coalesced around K8s. Managed offerings on every major cloud (EKS, GKE, AKS), a much larger pool of operators and tooling, and a deeper hiring pool. If you're starting fresh today, Kubernetes is the default.

The naive persona gives the textbook consensus. The grounded persona has specific reasons: a team of five, two K8s migrations at past jobs, a clear take that operators are over-engineering for their scale, and the flat closer. Either answer is plausible at first read. Only one is from this specific person.

What this doesn't yet prove

We're committed to publishing the eval. We're also committed to telling you what it doesn't yet cover.

One persona, deeply tested. 17.80 / 20 is from a small first test against one real person across 15 questions. We separately built a much larger production group. The full test at that scale is queued.
Same model family for the judge. The judge is the same model family as the personas. Cross-model judging, with a different model family as the judge, is on the roadmap. We expect the gap to hold. We won't claim it until we have the data.
Voice signature isn't for every use case. The biggest grounded gain is on voice signature. If your use case only needs factually accurate answers about a domain, the gain is smaller. Read the axis breakdown above.
One archive is one archive. The test is on one community archive. Replication across more archives is in progress. Each new archive gets its own fidelity number.

Reproduce it

We share the full test with serious customers: the questions, raw judge scores, prompts, and the runner. We'd rather you run it than take our word. Email hello@versim.ai for access.

How we build a persona — the end-to-end pipeline. Real archives, anonymized when we load them. A team of agents builds a structured dossier per person. The anonymiser is a one-way valve from grounded text to you.
How we treat the data — the privacy posture. What stays private, what you ever see, how communities can opt out, and what we will never do.
Frequently asked questions — plain answers to common questions about versim, the data, pricing, and the waitlist.
Back to versim — the landing page, with the sample panel run and the demo.

← Back to versim

The headline numbers

What fidelity means here

How we test

The 15 questions

The judge

The cost

Where the gap is

One example, illustrative

The question

Grounded persona

Naive persona

What this doesn't yet prove

Reproduce it

Related

Back to versim