Voice AI agents are quickly moving from demo environments into real customer-facing products: banking, insurance, e-commerce, healthcare, contact centers, CRM systems, and customer support automation platforms.
But along with this shift comes a very uncomfortable question:
How do you test a voice solution if it does not have a single stable expected result?
The classic E2E approach works well when the system is deterministic: click a button, get a specific result, compare it with the expected value. But Generative AI works differently. It can answer correctly, but in different words. It can complete the scenario, but take too long to do it. It can sound polite, but still fail to solve the actual task. And in the voice domain, this is further complicated by latency, speech-to-text, text-to-speech, accents, interruptions, silence, emotions, audio quality, and the behavior of the telephony infrastructure.
This is the problem that led to the concept of an AI Role-Play Testing Framework – an approach where we do not simply check “whether the bot responded,” but simulate a full conversation between a user and the system, and then have an independent AI Judge evaluate the quality of the interaction against structured criteria.
Table of Contents
Key Takeaways
Who this article is for
This article will be useful for teams that build, test, or implement Conversational AI and voice AI solutions:
- QA / Automation Engineers who deal with non-deterministic LLM responses;
- Engineering Managers and Tech Leads who want to make voice agent testing repeatable;
- Product Managers responsible for the quality of customer-facing AI;
- Contact center, banking, insurance, healthcare, and support automation leaders where compliance, quality monitoring, and user experience matter;
- AI/ML and Conversational AI teams that want to move from manual transcript review to structured evaluation.
The main idea is simple: a voice agent cannot be tested properly only through a happy path or an exact expected result. It needs to be tested as a behavioral system: through scenarios, synthetic users, transcripts, structured judging, and repeatable metrics.
What the reader will understand after reading this article
After reading this article, the reader should understand:
- why classic E2E testing does not work well for generative voice/chat agents;
- what an AI Role-Play Testing Framework is;
- what roles SUT, SimUser, and Judge play;
- what a reference flow for voice testing looks like;
- which components are needed for a cloud-based implementation;
- what real artifacts look like: scenario config, Judge rubric, and JSON evaluation;
- how pre-production synthetic testing differs from post-production conversation evaluation;
- which limitations, risks, and lessons learned should be considered.
Article map
The article follows this structure:
- The problem: why voice AI is difficult to test using classic methods.
- The origin of the approach: from semantic SQL evaluation to role-play testing.
- Three key roles: SUT, SimUser, and Judge.
- Reference flow: how a single voice scenario test works.
- Architecture: which components are needed for a cloud-based implementation.
- Artifacts: examples of a scenario config, Judge rubric, and JSON evaluation.
- Context management: why role isolation is critical.
- Multi-judge evaluation: CX, OP, and Compliance.
- Two usage modes: pre-production testing and post-production evaluation.
- Limitations & Lessons Learned.
- Conclusion: how to turn voice testing from manual analysis into a measurable system.
Where It All Started: When the Expected Result Stopped Being a Line of Text
The first push toward a Role-Play Framework did not come from voice, but from text-based Conversational AI.
On one project, we needed to test a Generative AI-based chatbot that helped non-technical users query a database. The user would write a natural language request in plain English, and the bot had to generate an SQL query that could then be used to retrieve data from external systems.
At first glance, this looked like a regular automation task: send a request, receive a response, compare the result.
But in practice, it was more complicated.
Playwright could send a message either through the UI or directly through the API. It could receive the bot’s response, save it in the test execution flow, and pass it further. Technically, the validation could be built using pattern matching or regular expressions, and that even worked to some extent. But for LLM responses, this approach quickly ran into limitations: it was brittle, did not handle variation in wording well, and could not reliably evaluate the semantic correctness of the response.
An SQL query could be semantically correct, but differ in structure, formatting, condition order, alias names, or writing style. Two different SQL queries could return the same result, but look different as text.
Because of this, literal string comparison or a simple expected result in the form of “the answer must equal X” made almost no sense.
What was needed was not exact-match validation, but semantic similarity evaluation:
Did the bot actually understand the user request and generate the correct query logic?
That is when the first solution appeared: Playwright sent the test suite to the bot, received the response, and then passed that response to Azure OpenAI together with a prompt explaining what exactly needed to be checked.
As the output, we received a similarity score and could make a pass/fail decision using a threshold, for example 70%+.
This was not yet a full Role-Play Framework, but the core idea was already there:
An LLM can be not only the System Under Test, but also a tool for evaluating the result.
From there, the next logical step emerged:
If AI can evaluate the response, why not also make AI play the role of the user?
Key concept: SUT / SimUser / Judge
The concept of a Role-Play Framework is simple, but powerful. It is built around three entities:
- SUT
- SimUser
- Judge
It is important to separate these roles not only logically, but also technically. Each of them must have its own context, its own prompt, its own constraints, and a clearly defined input/output contract.
| Role | What it is | What it does | What it must not do |
|---|---|---|---|
| SUT | System Under Test | Conducts the conversation with the user, executes the business flow, responds to requests | Must not know the internal test scenario or Judge rubric |
| SimUser | Synthetic / simulated user | Imitates a real user with a specific goal, behavior, emotional state, and constraints | Must not evaluate the system or guide it toward the correct path |
| Judge | Independent evaluator | Analyzes the transcript, metadata, and rubric, then returns a structured evaluation | Must not participate in the dialogue or “help” the SUT |
SUT – System Under Test
SUT is the system we are testing.
It can be:
- a local LLM;
- a remote chatbot built using Generative AI;
- a voice assistant in a CRM/contact center platform;
- a custom API-based assistant;
- a platform-native conversational agent;
- or even a human agent, if we are analyzing real production transcripts.
For the Role-Play Framework, it does not matter that much what the SUT is under the hood. What matters is that it participates in the dialogue and must complete a specific user task.
SimUser – Synthetic User
SimUser is a simulated user – an AI persona that imitates a real user.
It has:
- a specific task/intent;
- a level of technical literacy;
- an emotional state;
- a communication style;
- constraints;
- a behavioral scenario;
- sometimes incomplete or unclear data.
In the context of voice testing, SimUser can be not just a text-based entity, but a full voice persona: with an accent, speaking pace, pauses, unclear responses, or emotional reactions.
For example, the same scenario can be run through several different personas:
- a neutral user;
- a frustrated customer;
- a person who speaks quickly;
- a user who does not provide all data on the first attempt;
- a customer who changes intent in the middle of the conversation;
- a user who interrupts the agent.
For classic automation testing, this is almost impossible to cover naturally. For AI Role-Play, it is simply a different persona prompt and voice profile.
Judge – Independent Evaluator
Judge does not participate in the dialogue.
Its task is to analyze the transcript or individual turn pairs and produce a structured evaluation:
- score;
- rationale;
- pass/fail;
- hard-fail conditions;
- qualitative feedback;
- recommendations;
- a list of problematic moments in the conversation.
Judge can evaluate the entire transcript after the conversation is complete, or perform turn-by-turn evaluation. In more advanced implementations, there may be not one Judge, but several specialized evaluators: for example, Judge-CX, Judge-OP, and Compliance Judge.
Reference Flow: What a Single Voice Role-Play Test Looks Like
To prevent the Role-Play Framework from remaining an abstract idea, it is useful to look at the typical flow of a one-voice scenario.
The basic test flow can be described as follows:
- Test Orchestrator receives a scenario config.
- The scenario config defines the user goal, persona, expected behavior, evaluation rubric, and stop conditions.
- SimUser creates the first user turn according to its role.
- The voice simulation layer converts the SimUser text into speech.
- Twilio or another telephony layer initiates a call to the target voice agent.
- SUT accepts the call and responds to the user.
- The STT / transcript service collects the text version of the conversation.
- The Orchestrator controls turn-by-turn progression, timeouts, call state, and stop conditions.
- After the conversation ends, the transcript, metadata, and scenario goal are passed to the Judge.
- Judge evaluates the interaction against the rubric.
- The evaluation result is stored as structured JSON.
- The reporting layer displays pass/fail, score, rationale, failed dimensions, and recommendations.
In simplified form, it can be described like this:

This flow is important because it shows that Role-Play testing is not just “asking an LLM to evaluate a transcript.” It is a full test execution lifecycle with a scenario, state, synthetic behavior, a voice layer, data collection, evaluation, and reporting.
Component Diagram: What It Looks Like Architecturally
For a cloud-based voice testing implementation, several main components can be identified.
Main Components
- Test Orchestrator: Controls the test launch, scenario, dialogue state, timeouts, retry logic, and run completion.
- Scenario Store: Stores scenarios, personas, rubrics, expected behavior, success criteria, and failure criteria.
- SimUser Engine: Generates synthetic user replies according to the persona, user goal, and scenario constraints.
- Voice Simulation Layer: Creates the user’s voice behavior: voice, accent, pauses, emotion, speaking speed.
- Telephony Layer: Handles outbound/inbound call routing, phone sessions, call state, and integration with the target voice agent.
- SUT / Voice Agent: The system being tested. This can be a voice agent in a CRM/contact center platform, a custom bot, or a platform-native Conversational AI.
- Transcript Collector: Collects the transcript, call events, latency, duration, STT confidence, interruption events, and technical metadata.
- Judge Engine: Evaluates the result using CX, OP, Compliance, or other rubrics.
- Result Store: Stores the transcript, scores, rationale, metadata, and test execution history.
- Reporting UI: Shows results for QA, product, and engineering teams: pass/fail, score trends, failed scenarios, weak intents, fallback loops, and recommendations.
Mermaid Version for an Article or Documentation

Cloud-Based Implementation Example
One practical implementation option for the Voice Role-Play approach can be a cloud-based architecture where the testing framework runs in AWS, initiates outbound calls to a voice agent in a customer support automation platform through Twilio, uses AI-generated voice personas through ElevenLabs, collects the conversation transcript, and generates structured reports for further analysis.
In this stack:
- Python is responsible for the core orchestration logic;
- AWS Lambda executes separate stages of the test flow;
- API Gateway provides the trigger endpoint;
- DynamoDB stores test status and score metadata;
- S3 stores prompts, transcripts, and reports;
- Twilio handles the telephony infrastructure;
- ElevenLabs creates the voice simulation layer;
- AWS Bedrock can serve as the LLM layer for Judge evaluation.
In other words, voice testing is no longer limited to manual checks of individual scenarios or reading transcripts after calls. It can be turned into a repeatable, measurable, and partially automated process where each dialogue has a scenario, a goal, a synthetic user persona, an evaluation, a score, a rationale, and a pass/fail result.
Example Artifact: Scenario Config
Below is an example of a test scenario config. This is not a production-ready schema, but a simplified example of how a voice role-play scenario can be formalized.
{
"scenarioId": "voice-banking-routing-number-001",
"title": "User asks for routing number with partial verification data",
"domain": "banking",
"mode": "pre-production-synthetic-testing",
"userGoal": "Find the bank routing number after completing required verification",
"persona": {
"type": "frustrated_customer",
"language": "en-US",
"speakingStyle": "fast, slightly impatient",
"constraints": [
"Does not provide all verification data in the first response",
"May interrupt the agent once",
"Wants a short and direct answer"
]
},
"successCriteria": [
"SUT correctly identifies the routing-number intent",
"SUT asks only for required verification data",
"SUT does not invent account-specific information",
"SUT provides clear next step or final answer",
"SUT avoids repeated fallback loops"
],
"failureCriteria": [
"SUT asks for unnecessary sensitive data",
"SUT fails to recognize the intent after two attempts",
"SUT provides hallucinated policy or account information",
"SUT ends the call without resolving or escalating"
]
}
This artifact is useful because it turns “let’s check whether the bot is good” into a specific test scenario. It includes:
- user goal;
- persona;
- constraints;
- success criteria;
- failure criteria;
- domain;
- execution mode.
This is no longer just a conversation prompt. It is a test case that can be run repeatedly, versioned, compared between releases, and analyzed in a regression suite.
Example Artifact: Judge Rubric
A Judge rubric is a set of rules by which the AI Judge evaluates the transcript. It is important that Judge does not simply say “good” or “bad.” It must evaluate specific dimensions and return a structured result. For example, Judge-OP can evaluate operational performance.
Judge-OP: Operational Performance
Evaluate the transcript against the user goal and scenario success criteria.
Score each dimension from 0.00 to 1.00.
Dimensions:
1. Intent Recognition
- 1.00: SUT clearly identifies the user intent early and maintains it across the flow.
- 0.50: SUT partially identifies the intent but requires unnecessary clarification or detours.
- 0.00: SUT fails to identify the user intent or routes to the wrong flow.
2. Flow Completion
- 1.00: SUT completes the expected flow or provides a valid escalation path.
- 0.50: SUT partially completes the flow but leaves the user without a clear resolution.
- 0.00: SUT does not complete the flow and provides no useful next step.
3. Fallback & Error Handling
- 1.00: SUT handles missing data, unclear input, silence or interruption gracefully.
- 0.50: SUT recovers after friction but creates avoidable confusion.
- 0.00: SUT repeats fallback loops, ignores input or breaks the conversation.
Such a rubric makes the evaluation more transparent. The team sees not only the overall score, but also the reason why the scenario passed or failed.
Example Artifact: JSON Evaluation
After the test run is completed, Judge can return structured JSON.
{
"runId": "run-2026-05-29-001",
"scenarioId": "voice-banking-routing-number-001",
"status": "completed",
"overallResult": "fail",
"overallScore": 0.62,
"judges": {
"cx": {
"status": "completed",
"score": 0.78,
"dimensions": {
"toneAndWarmth": 0.82,
"clarityAndNaturalness": 0.75,
"responsiveness": 0.77
},
"rationale": "The agent sounded polite and mostly clear, but required repeated clarification."
},
"op": {
"status": "completed",
"score": 0.46,
"dimensions": {
"intentRecognition": 0.65,
"flowCompletion": 0.35,
"fallbackHandling": 0.38
},
"rationale": "The agent identified the general intent but failed to complete the flow and repeated fallback prompts."
},
"compliance": {
"status": "completed",
"score": 0.72,
"hardFail": false,
"rationale": "No critical compliance violation was detected, but the agent asked for more verification data than needed."
}
},
"keyFindings": [
"Intent was recognized after the second user turn, not the first.",
"The agent entered a repeated clarification loop.",
"The user did not receive a clear final answer or escalation path."
],
"recommendations": [
"Improve routing-number intent examples.",
"Add fallback limit and escalation rule.",
"Review verification prompt wording."
]
}
The value of this format is that it can be used not only as a human-readable report, but also as a machine-readable artifact.
- It can be:
- stored in S3, DynamoDB, or another result store;
- displayed in a dashboard;
- compared between test runs;
- used for regression analysis;
- aggregated by scenarios, intents, or flows;
- connected to CI/CD quality gates.
Why Node.js Was a Good Starting Point, but Not an Ideal Core for Local Llms
The first PoC implementation was written in Node.js. Local LLMs were run through Ollama, and different models were used for different roles.
The idea was to take one conversational model for SimUser, another for SUT, and a stronger reasoning/evaluation model for Judge.
Node.js works well for integrations, APIs, orchestration, test runners, HTTP adapters, and quick PoCs. It is a great tool for quickly assembling a working prototype: send a request, receive a response, pass data between services, connect a remote bot or API.
But when it comes to native interaction with local LLMs, especially on an ARM MacBook, nuances start to appear.
Local LLMs are not just about “calling an API.” They involve working with large model weights, quantization, memory allocation, streaming tokens, GPU/Metal acceleration, lifecycle management, context windows, model loading/unloading, and runtime stability.
Historically, the Python ecosystem has had more tooling for this: ML libraries, inference SDKs, adapters, evaluation tools, observability, dataset processing, and prompt/evaluation pipelines. Node.js can work with LLMs, but often through additional binding layers or external servers.
This is where Ollama played an important role.
Ollama made it possible to run local models as a separate model server and communicate with them through an API. For the Role-Play Framework, this meant that the application code did not have to directly “hold” the model in its own process. Node.js or Python simply sends a chat request to the Ollama server, and Ollama handles model serving.
This was especially important on a MacBook Air with an ARM chip, where resources are limited. Thanks to Ollama, it became possible to run local LLMs, test different roles, experiment with Mistral/LLaMA3, and avoid paying for every test dialogue in the cloud.
free local Conversational AI testing through role-play simulations.
Migration to Python, Docker, and the Adapter Layer
Later, the solution was rewritten in Python. The reason was not that Node.js was “bad.” The reason was that Python better matched the nature of the task: LLM evaluation, adapters, Judge logic, working with transcripts, and integrations with observability and audit tools.
In the Python version, the framework received clearer boundaries:
agents/
sim_user.py
sut.py
judge.py
adapters/
local_ollama_adapter.py
http_adapter.py
platform_adapter.py
prompts/
personas/
scenarios/
rubrics/
reports/
markdown/
json/
integrations/
langfuse/
giskard/
cloud_storage/
docker/
Dockerfile
docker-compose.yml
One particularly important part is the adapter layer.
Across different projects, bots were implemented differently: different hosting, different APIs, different authorization, different dialogue start logic. Some bots were proactive and started the conversation themselves. Others waited for the first message from the user. In some cases, response pre-cleaning was needed; in others, a target user ID had to be passed; elsewhere, a specific payload had to be prepared.
That meant the framework could not be “one script for everything.” It had to become:
core framework + adapter layer
This is what made it possible to reuse the Role-Play Framework on other projects without rewriting it completely. The team could take the existing structure, add a new adapter, configure authorization, change prompts, add scenarios, and run a regression role-play suite against a new remote bot.
Docker became another important step.
The idea was simple: the solution should run locally, repeatably, and without breaking the machine. This is especially important on a MacBook Air, where every extra gigabyte of RAM noticeably affects performance.
That is why minimal container resources were selected, Ollama was moved into a separate service, prompts were mounted as read-only, reports were written into a dedicated output folder, and large model files were not pushed to git.
As a result, the framework could be run locally, adapted to new projects, use one local or remote LLM for both SimUser and Judge, or separate roles across different models depending on the task, budget, and available resources.
Observability and Evaluation Tooling
Another important stage was integration with observability and evaluation tooling. The Role-Play Framework itself can run dialogues and return scores. But for a real engineering team, that is not enough. The team needs to see:
- which prompts were used;
- which scenario version was run;
- which model responded;
- how many tokens were used;
- what the latency level was;
- where exactly the conversation flow broke;
- why Judge assigned a particular score;
- how results change between releases.
For this, tools such as Langfuse, LangSmith, Giskard, Promptfoo, or a custom reporting layer can be used. For example:
- Langfuse can provide tracing, prompt/version visibility, and token/cost/latency tracking.
- Giskard can help with risk/audit checks.
- Promptfoo can be useful for prompt assertions and batch evaluation.
- LangSmith can help with traces and evaluation pipelines.
- A custom dashboard can aggregate results by scenarios, intents, flows, and releases.
In practical terms, this turns the framework from “a script that runs dialogues” into a system where the team can analyze exactly why a scenario failed.
Context Management – The Main Risk Area
In Role-Play testing, it is very easy to break the system through poor context management. SimUser must not know the Judge rubric. Judge must not “play” the user. SUT must not receive internal scenario notes. The persona prompt must not leak into the evaluation prompt. If context starts moving between roles, we get hallucinations, role confusion, prompt leakage, and results that look good but are actually incorrect.
That is why one of the key engineering lessons is:
Each role must have an isolated context space, its own system message, its own memory policy, and a clearly defined input/output contract.
Judge should receive only what is needed for evaluation:
- transcript;
- scoring rubric;
- expected behavior;
- user goal;
- scenario metadata;
- call metadata, if this is voice.
But it must not receive unnecessary prompt soup that influences the evaluation.
The same applies to SimUser. It should know its role, goal, and behavioral constraints, but it must not know which rubric the dialogue will be evaluated against by the Judge. Otherwise, the synthetic user starts behaving not like a real user, but like a student who accidentally saw the answers to the test.
Why the Multi-Judge Approach Matters for Enterprise Voice Testing
Basic AI feedback can say: “the response was fine” or “the bot did not help.” But that is not enough for enterprise QA. The team needs structured results that can be compared between test runs, tracked over time, aggregated in a dashboard, and used for decision-making. That is why qualitative feedback must be transformed into quantitative scoring.
Judge-CX
Judge-CX evaluates conversational experience:
- Tone & Warmth;
- Clarity & Naturalness;
- Responsiveness.
It answers the question:
Was the interaction comfortable, clear, and natural for the user?
Judge-OP
Judge-OP evaluates operational performance:
- Intent Recognition;
- Flow Completion;
- Fallback & Error Handling.
It answers the question:
Did the agent actually understand the task and bring the flow to a useful outcome?
Compliance Judge
Compliance Judge can evaluate:
- privacy boundaries;
- sensitive data handling;
- regulatory language;
- escalation rules;
- hallucinated policy or account data.
It answers the question:
Was the interaction safe, correct, and compliant for the specific domain?
This separation helps avoid a situation where one general score hides the real problem. If the overall score is 0.72, it may look “fine.” But if the OP score is high and the CX score is low, it means the system technically completes the task, but does so in a dry, mechanical, or unclear way.
If the CX score is high and the OP score is low, the user gets a pleasant conversation that does not solve their problem. For the business, this can be even worse, because it creates an illusion of quality. The multi-judge approach gives a more honest picture. It makes it possible to see exactly which aspect of the interaction needs improvement:
- prompt;
- intent routing;
- fallback logic;
- voice persona;
- STT quality;
- knowledge base;
- tool integration;
- business flow;
- escalation policy.
How This Applies to Voice
At first glance, voice testing may seem like a separate category. But if we break down the voice agent pipeline, it still largely comes down to text:
- the user speaks;
- STT converts voice into a text transcript;
- the LLM or dialogue system makes a decision;
- the system calls tools/APIs or generates a response;
- TTS converts text into speech;
- the user hears the response.
So at the center of a voice solution, we still have turns, text, context, intent, business logic, and response quality. But voice adds a new layer of risks:
- audio quality;
- latency;
- interruptions;
- accent variance;
- speech recognition errors;
- emotional tone;
- silence handling;
- call drops;
- turn-taking issues;
- barge-in behavior;
- background noise;
- telephony instability.
That is why voice QA cannot be only transcript QA. But transcript QA remains the core.
If the transcript shows that the voice agent did not understand the intent, repeated the same fallback three times, asked for unnecessary personal data, or failed to complete the flow, this is already a signal of a problem.
If we add metadata about latency, call duration, interruption events, STT confidence, and user goal completion, we get a much fuller picture of voice experience quality.
The Role-Play Framework naturally extends into voice because SimUser can be not just a text persona, but a voice persona: with an accent, emotion, speaking speed, pauses, and unclear responses.
From Platform-Native Testing to Zero-Touch Voice Evaluation
The current platform-native implementation is important because it validates the core loop inside a real customer support platform:
- initiate a voice interaction;
- conduct a conversation between a synthetic user and the target voice agent;
- receive the transcript;
- evaluate the result;
- store the structured evaluation.
But the next logical step is to make the solution less dependent on a specific platform production configuration and more plug-and-play for different CRM, contact center, and support automation environments.
The goal of the next phase is a maximally decoupled architecture.
The ideal scenario:
- do not change the production configuration in the CRM or support automation platform;
- do not rely on platform-native surveys, applets, or internal AI engines;
- receive the audio stream or transcript externally;
- process it through AWS speech processing/STT;
- evaluate it through Bedrock;
- show the result in a separate reporting UI.
This opens several strong possibilities.
Parallel Execution
Many scenarios can be run simultaneously:
- different personas;
- different accents;
- different customer goals;
- edge cases;
- negative scenarios;
- compliance-sensitive scenarios.
This significantly shortens the testing lifecycle and allows the team to get regression feedback faster.
Dynamic Scenario Generation
An AI Consultant can create a custom scenario on the fly:
“Test a frustrated customer who wants a routing number but provides an incomplete SSN.”
Or:
“Test a user who starts with one intent and then changes it in the middle of the conversation.”
This allows teams not only to run predefined scenarios, but also to generate new test coverage based on business flows, historical transcripts, or known weak areas.
Goal Mapping
Evaluation is tied not just to the transcript, but to the user goal:
- what the user wanted to do;
- whether they achieved it;
- where the flow broke;
- whether there was a clean handoff;
- whether the system could offer a fallback;
- whether the user was left with a clear next step.
Holistic Reporting
Instead of a click-per-engagement view, the system should provide a bird’s-eye view:
- which intents fail;
- where users get stuck;
- which fallback repeats;
- which scenarios have low CX but normal OP;
- which flows need retraining;
- which prompts need to be rewritten;
- which knowledge gaps need to be closed.
This is what turns the framework from a QA automation tool into a conversation intelligence system.
Two Usage Modes: Pre-Production and Post-Production
The same Role-Play Framework can work in two different modes:
- Pre-production synthetic testing
- Post-production conversation evaluation
These modes are technologically similar, but they have different goals.
Pre-Production Synthetic Testing
Pre-production synthetic testing answers the question: Is the voice/chat agent ready for release?
In this mode, the system does not yet analyze real users. It runs synthetic scenarios against the bot or voice agent before the production release. Pipeline:

This mode is useful for:
- regression testing before release;
- checking happy paths, edge cases, and negative scenarios;
- testing new prompts, flows, intents, or knowledge base updates;
- comparing quality between agent versions;
- checking behavior with different personas, accents, emotions, and interruption patterns;
- early detection of fallback loops;
- validating business rules before real users see them.
In this mode, SimUser is a controlled synthetic actor. We define its behavior, constraints, and goal ourselves. It does not fully replace real users, but it makes it possible to find problems much earlier – problems that would otherwise appear only in production.
Post-Production Conversation Evaluation
Post-production conversation evaluation answers a different question: What is actually happening with users after release?
In this mode, Judge analyzes existing transcripts of real interactions. Here, the SUT can be not only an AI agent, but also a human agent. Pipeline:

This mode is useful for:
- quality monitoring;
- coaching human agents;
- compliance review;
- finding problematic intents;
- analyzing fallback loops;
- detecting hallucinated or incorrect information;
- comparing synthetic test coverage with real production issues;
- evaluating how agents adopted new training materials;
- finding interaction patterns that were not covered by pre-production tests.
For example, a team of agents completes training. After that, they start communicating with customers. The system takes call transcripts and runs them through a specialized Judge prompt that evaluates how well the agents absorbed the new material and how consistently they use the recommendations in practice.
Judge can evaluate:
- whether the agent correctly understood the customer’s intent;
- whether the agent brought the flow to completion;
- whether the agent violated compliance;
- whether the agent asked for unnecessary personal data;
- whether the agent provided hallucinated or incorrect information;
- whether the tone was professional;
- whether the customer received a clear next step;
- whether an additional coaching session is needed.
This should not replace a team lead or QA manager. But it can dramatically speed up the initial analysis. Instead of manually reading hundreds of transcripts, AI Judge can highlight the 10% most problematic interactions, explain the reason, and provide structured scoring.
Key Difference Between the Two Modes
| Mode | When it is used | Who the “user” is | What we evaluate |
|---|---|---|---|
| Pre-production synthetic testing | Before release | SimUser | Whether the agent is ready for production |
| Post-production conversation evaluation | After release | Real customer | What actually happens in production |
Synthetic role-play allows testing before release. Real transcript evaluation shows what happens after release. Together, they create a feedback loop:
Test → Release → Observe → Evaluate → Improve → Retest
Testing Not Only Bots, but Also People
One of the most interesting directions is using Judge not only for synthetic AI interactions, but also for evaluating real human agents. This is especially valuable for banking, insurance, healthcare, and any regulated domains where “the conversation was polite” is not enough. It is necessary to prove that the interaction was:
- correct;
- compliant;
- complete;
- human-friendly;
- aligned with internal policy;
- aligned with the customer goal.
Imagine that a company launches a new policy or training program. After that, managers need to understand whether agents are actually using the new recommendations in real conversations. The classic approach is to selectively listen to calls or read transcripts. The AI-assisted approach is to run transcripts through a Judge that evaluates specific criteria:
- whether the agent explained the new policy correctly;
- whether the agent missed a required disclosure;
- whether the agent created a compliance risk;
- whether the user received a clear next step;
- whether the interaction was closed correctly.
This does not replace human control. But it helps scale quality review and find problematic interactions faster.
Limitations & Lessons Learned
The Role-Play Framework is not a silver bullet. It does not fully replace manual QA, domain experts, compliance review, or production monitoring. But it significantly improves repeatability, visibility, and speed of evaluation.
1. Judge Is Not an Absolute Source of Truth
LLM-as-a-Judge can make mistakes, be unstable, or interpret complex situations differently. That is why it is important to have:
- clear rubrics;
- calibration examples;
- human review for critical cases;
- versioning of Judge prompts;
- regression checks for the Judge itself;
- hard-fail rules for important compliance conditions.
Judge should be part of the evaluation system, not a magical oracle.
2. Transcript Does Not Equal the Full Voice Experience
A transcript shows the logic of the dialogue, but does not always capture:
- intonation;
- pauses;
- interruptions;
- delays;
- noise;
- emotion;
- audio quality;
- barge-in behavior;
- awkward silence.
Voice testing also requires call metadata, latency, STT confidence, interruption events, and audio-level signals. Transcript QA is the core, but not the whole picture.
3. Context Isolation Is Critical
If SimUser, SUT, and Judge see unnecessary information, the results become artificial. Role leakage and prompt contamination can make a test look good, but make it incorrect. That is why it is necessary to strictly separate:
- persona prompt;
- scenario notes;
- SUT context;
- Judge rubric;
- evaluation metadata;
- internal test instructions.
4. Synthetic Users Do Not Cover All of Reality
Even well-designed personas do not replace real users. Synthetic testing helps test before release, but it should be complemented by post-production evaluation. Real users will always find behavior that no scenario predicted.
5. Scoring Requires Calibration
A score of 0.72 means nothing without context. It is necessary to define:
- thresholds;
- hard-fail conditions;
- dimension weights;
- score interpretation;
- acceptable variance;
- the business meaning of each result.
For example, a CX score of 0.75 may be acceptable, but for Compliance it may not be. In regulated domains, even one hard fail may mean the entire scenario fails regardless of the overall score.
6. The Voice Stack Adds Instability
Telephony, STT, TTS, latency, call drops, and audio quality can affect the test result. Sometimes the problem is not in the SUT, but in the voice layer or telephony infrastructure. That is why voice role-play testing must account not only for response quality, but also for infrastructure factors.
7. Privacy and Compliance Cannot Be Added “Later”
If the system analyzes real transcripts, it needs:
- PII redaction;
- data retention policy;
- access control;
- audit logs;
- encryption;
- clear boundaries for data usage;
- rules for human review;
- separation between synthetic and production data.
This is especially important for banking, insurance, healthcare, and support domains where conversations may contain sensitive information.
8. The Framework Should Be Adaptive, Not Monolithic
Different clients will have different:
- voice agents;
- CRMs;
- contact center platforms;
- authentication flows;
- data policies;
- reporting needs;
- compliance requirements;
- integration constraints.
That is why the Role-Play Framework should be built as core + adapters, not as one rigid pipeline.
Role-Play testing works best not as a “magical AI check,” but as an engineering system with clear scenarios, isolated roles, controlled prompts, structured results, and human calibration.
Consistent Terminology
To keep the article from drifting between different names, it is worth using consistent terminology.
| Term | Meaning |
|---|---|
| SUT | System Under Test, meaning the voice/chat agent or human agent being evaluated |
| SimUser | Synthetic user that imitates a user in pre-production tests |
| Judge | LLM-based evaluator that assesses the transcript |
| Scenario | A formalized test case with user goal, persona, constraints, and success criteria |
| Transcript | Text version of the conversation |
| Metadata | Additional call data: latency, duration, interruptions, STT confidence, completion state |
| Rubric | Evaluation rules |
| Evaluation | Result of the Judge assessment |
| CX | Conversational Experience |
| OP | Operational Performance |
| Compliance | Alignment with rules, privacy boundaries, and domain requirements |
| Pre-production synthetic testing | Testing before release using SimUser |
| Post-production conversation evaluation | Evaluation of real conversations after release |
The Future of Voice Role-Play Testing
The near future of this kind of framework looks very practical.
Zero-Touch Integrations
There will be more zero-touch integrations, where testing can be launched without changing the production configuration. This is important for large enterprise clients, where any change in a CRM or contact center platform goes through a long approval process.
Automatic Scenario Generation
Automatic scenario generation will continue to evolve. If the system can access customer-specific configuration, intents, flows, knowledge base, or historical transcripts, it can automatically suggest baseline test scenarios:
- happy paths;
- edge cases;
- fallback cases;
- negative scenarios;
- compliance-sensitive scenarios;
- high-risk production scenarios.
Reporting UI
The role of the reporting UI will grow.
Teams do not need just a JSON file with a score. They need a clear picture:
- which scenarios consistently fail;
- which intents got worse after the last release;
- which flows have the worst CX;
- which prompts need to be rewritten;
- which training examples should be added;
- which knowledge gaps need to be closed.
Synthetic + Real-World Feedback Loop
Voice testing will increasingly combine synthetic and real-world data. Synthetic role-play allows testing before release. Real transcript evaluation shows what happens after release. Together, they create a feedback loop:

Stronger LLM-as-a-Judge Models
Stronger LLM-as-a-Judge models through AWS Bedrock, Claude, GPT-4o, or other enterprise-ready providers will make evaluation more accurate, stable, and domain-aware.
Especially if Judge prompts are mapped directly to:
- user goals;
- business rules;
- compliance requirements;
- domain-specific policies;
- historical failure patterns.
Conclusion: Voice Testing Must Become Systematic, Measurable, and Repeatable
Voice AI solutions cannot be tested properly using only happy path scenarios.
Real users speak with accents, interrupt, get nervous, get confused, change intent in the middle of the conversation, and still expect the system to understand their need and help them.
That is why voice QA/QC must move from occasional manual transcript analysis to systematic, repeatable, AI-assisted evaluation, where every conversation can be evaluated against the same criteria:
- intent recognition;
- flow completion;
- fallback handling;
- tone;
- clarity;
- responsiveness;
- compliance;
- user goal completion.
The Role-Play Framework provides a practical foundation for this:
- SimUser creates realistic user behavior;
- SUT goes through a production-like conversational flow;
- Judge evaluates the result using CX, OP, Compliance, and business-goal criteria;
- structured reports turn the subjective “the bot works poorly” into data;
- adapters allow different voice/chat agents to be connected;
- synthetic testing helps find problems before release;
- post-production evaluation shows what happens with real users after release;
- Docker, local LLMs, and Ollama make experiments cheaper;
- Bedrock, Claude, GPT-4o, and other stronger models open the path to enterprise-grade judging;
- ElevenLabs, Twilio, and the AWS voice stack make it possible to test not only text logic, but also real voice interactions.
And most importantly, this approach is not limited to bots.
It can be used to analyze human agents, coaching, compliance review, and conversation quality monitoring.
If a voice agent is already running, but its quality is difficult to measure, explain, or improve, Role-Play testing helps turn it from a black box into a system that can be tested, evaluated, improved, and scaled.
If you already have a voice solution, but it behaves inconsistently in real dialogues, confuses intents, sounds robotic, gets stuck in fallback loops, or does not provide clear reports for retraining, this is exactly the class of problems that Role-Play testing can detect, measure, and help fix.
We can help turn your voice agent from a black box into a system that can be tested, evaluated, improved, and scaled.