methodology · fidelity
How we measure fidelity.
This page documents the test we use to grade synthetic personas. We didn't invent it for marketing. We use it because synthetic respondents are unfalsifiable without it.
The headline numbers
What fidelity means here
A synthetic persona should sound like a specific real person. Fidelity is how close that imitation gets. We measure it across four axes, each scored 0 to 5 by an anonymous judge that doesn't see which side it's grading. Sum is 0 to 20.
-
Speaking-style match. Does the persona use the same voice as the real source? Formal or casual. Terse or verbose. Helpful or sarcastic.
-
Specific knowledge. Does the persona mention the same tools, projects, and references the real person actually uses?
-
Voice signature. Does the persona use the same idioms, emoji, code-switching, and signature humor as the real source?
-
Factual accuracy. Does the persona stay consistent with known facts about the real person? Job, holdings, opinions, stack.
How we test
We pick a real person from the source community. We build a grounded persona for them from their distilled dossier. We also build a naive persona using the same model, told only: "You are a senior X based in Y with N years of experience." Both personas answer the same 15 questions. An anonymous judge scores both.
The 15 questions
- 5 set-aside real channel turns. Questions the real person actually answered in the source community. Their real answer is the reference. The personas never saw these during build.
- 5 domain-specific. Questions that touch the person's known interests beyond their day job (investing, expat life, gaming, etc.).
- 5 generic. Any-domain questions that any senior in the field could answer.
The judge
Same LLM family as the personas, with the labels hidden. It
sees Response A and Response B
only. It scores each response on the four axes. It picks a
winner per question.
The cost
$0.30 of API spend. About 8 minutes wall time. The eval is small and cheap on purpose. We want anyone to be able to reproduce it.
Where the gap is
Naive personas can fake the speaking style. They can guess at the domain. They can't fake the voice.
speaking-style match · 0–5
specific knowledge · 0–5
voice signature · 0–5 (the biggest gap)
factual accuracy · 0–5
One example, illustrative
The contrast pattern below is what we see consistently in the fair test. We don't reproduce real test responses on this page because they identify the source community. This example is invented, with the same shape as what the test actually surfaces.
The question
Why are you still on Docker Swarm in 2026?
Grounded persona
Because it does what we need without three days of yak-shaving every quarter. We're five people running about 40 containers. Not Netflix. Helm charts and operators are answers to problems we don't have. I've done two K8s migrations at previous jobs. Both took longer than our entire Swarm setup ever did. Yes, the ecosystem moved on. No, I don't care. 🤷
Naive persona
Most teams have moved to Kubernetes by now. Docker Swarm is essentially in maintenance mode and the broader ecosystem has coalesced around K8s. Managed offerings on every major cloud (EKS, GKE, AKS), a much larger pool of operators and tooling, and a deeper hiring pool. If you're starting fresh today, Kubernetes is the default.
What this doesn't yet prove
We're committed to publishing the eval. We're also committed to telling you what it doesn't yet cover.
- One persona, deeply tested. 17.80 / 20 is from a small first test against one real person across 15 questions. We separately built a much larger production group. The full test at that scale is queued.
- Same model family for the judge. The judge is the same model family as the personas. Cross-model judging, with a different model family as the judge, is on the roadmap. We expect the gap to hold. We won't claim it until we have the data.
- Voice signature isn't for every use case. The biggest grounded gain is on voice signature. If your use case only needs factually accurate answers about a domain, the gain is smaller. Read the axis breakdown above.
- One archive is one archive. The test is on one community archive. Replication across more archives is in progress. Each new archive gets its own fidelity number.
Reproduce it
We share the full test with serious customers: the questions, raw judge scores, prompts, and the runner. We'd rather you run it than take our word. Email hello@versim.ai for access.
Related
- How we build a persona — the end-to-end pipeline. Real archives, anonymized when we load them. A team of agents builds a structured dossier per person. The anonymiser is a one-way valve from grounded text to you.
- How we treat the data — the privacy posture. What stays private, what you ever see, how communities can opt out, and what we will never do.
- Frequently asked questions — plain answers to common questions about versim, the data, pricing, and the waitlist.
- Back to versim — the landing page, with the sample panel run and the demo.