methodology · pipeline
How we build a persona.
Most "AI persona" tools guess from a job title. Ours distill from real conversations someone actually had. This page walks the pipeline.
What a persona actually is, here
A versim persona is a typed role-play of a specific real person from a community we have permission to model. Not a generic "senior X" stereotype. One person's career, tools, tones, opinions, and signature phrasing, compressed into a structured profile and made queryable.
the pipeline
Six stages, archive to answer
-
Archive in, anonymized when we load it. We start with a community's archive: messages, threads, timestamps, social context. We only work with archives we have permission to use. Identifiers are stripped at load time, so the rest of the pipeline never sees raw names, handles, or contact information. Communities can request deletion at any time.
-
A dossier per person. For each real person in the archive, we build one structured profile. It covers their career, the tools they use and how deeply, their speaking style, their tone in conversation, rough seniority and current goals, who they listen to, personal details, and a short summary. Every claim points to the exact message that supports it, checked against the source text at build time. Made-up content is a bug. We reject it in tests.
-
A team of agents builds it, not one big agent. Different aspects of a person need different lenses. We use a team of small specialised agents: one for career, one for tools, one for speaking style, one for tone, and so on. Each agent reads the archive and returns its specific slice with sources. One big monolithic agent would lose specificity and make things up. Many small agents stay focused.
-
The anonymiser is a one-way valve. Nothing crosses to you without going through it. The anonymiser strips real user IDs, message IDs, handles, display names, exact source-text spans of fifteen words or more, and any blocked phrases on the per-archive list. What passes through: anonymized response text, public labels like
persona_7_of_8that don't link back to anyone, and a count of how many spans were removed. You never see a dossier, a real name, or a long exact quote. -
A panel builder picks who answers. You don't ask one persona. You ask a panel. The builder takes typed filters: size, role tier, activity level, tone exclusions, tool requirements, location, current goals, and picks a balanced sample matching them. The build is repeatable: the same filters with the same starting point give you the same panel twice.
-
The panel answers, a summarizer combines. Each persona in the panel reads the question and answers in their own voice, grounded in their dossier plus a balanced sample of their real messages for cadence. Each claim is linked to evidence. The per-persona answers go through the anonymiser, then to a summarizer agent that builds the panel's collective answer, surfaces themes, and counts the sentiment split. That's what reaches you.
stage 2 · zoomed in
What a dossier looks like
A dossier is the long-form output of the team of agents. A JSON document with typed fields. The current shape covers:
Career timeline
Roles, how long, transitions, role types.
Tools and depth
Which tools they actually use, and how deeply, based on how they talk about them.
Speaking style
Formal, casual, terse, verbose; how it shifts with topic.
Conversation tone
Friendly-casual, helpful-terse, sarcastic-but-helpful, complaining, self-deprecating, and so on.
Seniority + goals
Rough seniority and what they're currently trying to do: looking for info, switching jobs, hiring, investing, selling.
Who they listen to
Whose advice they take, who they push back on, who they ignore.
Personal details
Hobbies, opinions, location, signature humor, idioms, emoji.
Summary
A short summary that ties everything together.
stage 5 · zoomed in
Building a panel
A panel is a typed sample drawn from the dossier set. The builder accepts filters like these:
- Size — number of personas (typically 5 to 30).
- Role tier — for example, only senior individual contributors.
- Activity level — light, standard, or heavy posters.
- Tone exclusions — drop toxic, gatekeeping, or other tones you don't want.
- Tool requirements — must use a specific tool, optionally at a given depth.
- Locations — markets or regions.
- Goals — looking for info, investing, hiring, selling, and so on.
The build is repeatable. Two runs with the same filters and the same starting point give you the same panel. Two runs with the same filters and different starting points give you different but statistically matched panels. Either way the build is repeatable and inspectable.
stage 6 · zoomed in
How a persona answers
When a question comes in, each persona in the panel:
- Reads its full structured dossier (career, tools, tone, personal details, summary).
- Reads a balanced sample of its real messages, for voice cadence. The sample is picked for spread across topics, not concentrated on one thread.
- Receives the question exactly as you wrote it.
- Produces an answer in the source person's voice, grounded in dossier facts and message phrasings.
- Links each claim to evidence: which dossier entries and which message spans support the answer.
The persona answer goes through the anonymiser before reaching the summarizer. The summarizer never sees the dossier or the raw answer. You never see them either.
the one-way valve
The trust boundary, in detail
The anonymiser is the only path from real-person-grounded text to anything you see. It runs on every response, every time, with no toggle to turn it off.
Removed at the boundary
- User IDs and message IDs from the source archive
- Real names, display names, and handles
- Exact source-text spans of fifteen words or more
- Blocked phrases on the per-archive list
- Contact information of any form
Passes through to you
- Anonymized answer text in the source community's voice
- Public persona labels (e.g.,
persona_07_of_30) that don't link back to any real person - A count of how many spans were removed
honest scope
What this doesn't yet cover
We're committed to publishing the methodology and to being clear about its current scope.
- One archive today, more in progress. The pipeline runs end-to-end on one curated community archive. Replication across more archives is in progress, and each new archive gets its own validated build.
- Same model family across stages. The team of agents that builds dossiers, the persona that answers, and the judge that scores all currently use the same LLM family. Cross-model checking (a different model as the judge) is on the roadmap.
- Static dossiers. Dossiers are built once and updated on a schedule. They do not learn from your questions, by design. That's a leakage vector we don't want.
- Multi-turn behaviour is exploratory. Panel questions (one question, one answer) ship with a fidelity number. Multi-turn simulations (the Sim surface) are coming, but we don't yet have a fair test that maps cleanly to multi-turn behaviour.
Related
- How we measure fidelity — the fair test that scores a persona's match to the real source. 17.80/20 grounded vs 10.60/20 ungrounded on the same model.
- How we treat the data — the privacy posture. What stays private, what you ever see, how communities can opt out, and what we will never do.
- Frequently asked questions — plain answers to common questions about versim, the data, pricing, and the waitlist.
- Back to versim — the landing page, with the sample panel run and the demo.