Skip to main content
Use accent testing when you want to know whether a voice agent understands and serves callers regardless of accent. Run the same agent, test set, and metrics across multiple accent personas, then compare the results in a multi-run report. This workflow is for voice simulations. Chat simulations do not exercise speech recognition, so they cannot tell you whether your agent understands accented speech. The goal is not to produce a leaderboard. The goal is to find which accents change outcomes, then decide whether the next fix belongs in your agent prompt, speech recognition setup, tool handling, tracing, or accent coverage. Unlike Testing Across Audio Qualities, the audio-quality personas ship built-in, but there is no built-in persona for each accent. You create one persona per accent in your own organization, each using a distinct accent voice, and you make every accent persona mirror your baseline persona so the only variable is the accent.

Use An AI Agent

If you use Coval Agent Skills, an AI agent can handle both the setup and the follow-up analysis. Use the run-accent-testing skill to create the accent personas, launch the runs, and create the saved multi-run report (grouped by Persona) for you via the reports API. After the report exists, use the analyze-accent-report skill to turn the report into recommended agent fixes. To have an AI agent run this workflow for you, paste this prompt into your coding agent or local LLM:
Use the Coval `run-accent-testing` skill:
https://github.com/coval-ai/coval-external-skills/tree/main/skills/runs/run-accent-testing

I want to test this voice agent across speaker accents:
<paste Coval agent URL or agent ID>

Use this test set:
<paste Coval test set URL or test set ID>

Create one persona per accent — Standard Customer is the baseline, plus Indian (voice vidya), German-accented (marshal), Chinese-accented (ziyu), US Southern (cletus), Nigerian (kehinde), and Malaysian (darryl). Make each accent persona mirror Standard Customer's behavior so the only difference is the accent voice. These are lower-concurrency voices, so keep concurrency low and expect runs to take longer — that's fine. Run the same sampled cases and metrics across all personas, wait for them to finish, then create a multi-run report grouped by Persona so I can compare each accent against Standard Customer.

After the report exists, summarize the largest regressions and include the report URL plus representative simulation links.

1. Choose A Voice Agent

Pick one voice agent to test. For the cleanest comparison, keep the agent configuration fixed across all runs. For agents that emit traces, include trace-based speech-recognition and timing metrics such as STT Word Error Rate, time to first byte, or provider latency. If your agent is not sending traces yet, set up OpenTelemetry traces so Coval can measure speech recognition and agent-side timing alongside the recording. You can also have your coding agent help instrument traces using the Coval tracing skills.

2. Create The Accent Personas

Select a neutral baseline plus one persona per accent. Each accent persona uses a distinct accent voice; every other setting mirrors the baseline so differences come from the accent, not behavior.
Coval PersonaVoiceAccent
Standard Customer(built-in baseline)Neutral baseline
Indian Accent (Vidya)vidyaIndian
German Accent (Marshal)marshalGerman-accented
Chinese Accent (Ziyu)ziyuChinese-accented
US Southern Accent (Cletus)cletusUS Southern
Nigerian Accent (Kehinde)kehindeNigerian
Malaysian Accent (Darryl)darrylMalaysian
The accent personas are not built-in, so create them in your organization. In the Coval app, open Personas → New Persona and, for each accent, set the matching voice while keeping the same behavior prompt as your Standard Customer baseline. See Personas for the full list of voice and persona options. The run-accent-testing skill automates this: it reads your Standard Customer persona, then creates any missing accent personas with the same behavior and the correct accent voice.
Accent voices are locale-bound, so they do not all accept the same language code. Most accent voices reject en-US, so use the base en for every accent persona rather than copying a regional code like en-US from your baseline. Keeping the language identical across personas means the accent voice stays the only variable.
Use the same test set for every persona. If you subsample a test set, keep the same sampled cases across personas so differences come from the accent, not case selection.
The accent voices are lower-concurrency voices. Expect this suite to run at low concurrency, and expect simulations to queue and finish more slowly than a standard sweep. This is normal for accent testing — it is not a failure. Launch the runs at a low concurrency (2 is a safe default) rather than maximizing parallelism.

3. Select Metrics

Accent testing is primarily a speech-recognition stress test. Lead with recognition, then task success and call shape:
GoalUseful Metrics
Speech recognitionTranscription Error (word error rate from recorded audio, no traces required) is the reliable default; add STT Word Error Rate only after confirming your agent emits STT trace spans and the metric actually scores — otherwise it silently fails
Task outcomeComposite evaluation, task-completion LLM judges, or scenario-specific pass/fail metrics
ResponsivenessLatency, time to first audio, trace TTFB, or provider response-time metrics
Conversation flowTurn count, repeated confirmation or clarification loops, early termination, silence, and turn-level timing metrics
Do not use Percent Audio Above 300 Hz as a perceived audio-quality score. It measures pitch distribution, not accent comprehension, and accent testing is about whether your agent understands the caller, not how the audio sounds.

4. Launch The Runs

Launch voice simulations with:
  • one agent
  • one test set
  • the baseline plus accent personas listed above
  • the same metrics for every persona
  • a low concurrency, because these are lower-concurrency voices
Coval creates separate runs for each selected persona. In this workflow, each persona represents one accent. This keeps each accent comparable while still letting you analyze the set together.

5. Compare Accents

After the runs finish, create a multi-run report grouped by Persona.
  1. Open the runs list.
  2. Select the completed runs from the accent persona set.
  3. Create a multi-run report.
  4. Set Compare by to Persona.
  5. Use the grouped view to compare aggregate scores, speech-recognition accuracy, and latency across accents.
Look for regressions that appear only under specific accents. For example, a high task-success baseline with worse results and higher word error rate for one accent suggests a speech-recognition robustness issue for that accent rather than a general agent-quality issue. Before attributing a regression to the accent, check whether Standard Customer regressed on the same metric and the same test case. A failure the neutral baseline also hits — a tool that times out, an over-strict judge, a hard scenario — is an agent, tool, or metric issue, not an accent issue. Accent testing often surfaces these baseline-wide problems first, so confirm the neutral voice passes the case before concluding the accent caused the drop. When you build the report by hand in the app, the grouping is a manual step — the report builder defaults Compare-by to None with no URL parameter, so set Compare by → Persona in the report view. Creating the report via POST /v1/reports with compare_by: "persona" skips this: it saves the report already grouped. Also scan for UNKNOWN, missing, or unscored metric results. Under heavy accent stress, a judge may be unable to evaluate the conversation because the call ended early, the transcript is too sparse, or the interaction became too anomalous. Treat that as a signal to inspect the recording, not just as missing data. Because the accent voices are lower-concurrency, each accent run may have a smaller sample — treat small-sample conclusions as tentative.

6. Spot-Check Simulations

Do not stop at pass/fail metric columns. An accent can pass binary task metrics while the recording shows the agent mis-heard a name, number, or address. Treat very short calls, very long calls, latency spikes, and UNKNOWN or unscored metrics as spot-check triggers.
Open representative completed simulations from each accent, especially the lowest-scoring and most surprising rows from the grouped report. Listen to the recording and read the transcript to confirm how your agent handled the accent.
What To ReviewWhat To Look For
Speech recognitionWhether the transcript shows mis-recognized words, and whether your agent captured names, numbers, and addresses correctly.
RecoveryWhether your agent reads back or confirms critical details, and recovers from a misheard value instead of acting on the wrong one.
Confirmation loopsWhether one accent triggers extra clarification or repeat-back loops that the baseline did not need.
Task outcomeWhether mis-recognition propagated into a wrong task outcome, even when individual turns looked plausible.
If the listening pass affects a release decision, send representative simulations to Human Review. Use a review project to collect ground-truth labels for questions such as whether your agent captured the required information, recovered after mis-recognition, and completed the task. Use Collaborative mode when you want one shared answer per simulation, or Individual mode when you want independent reviewer agreement.

7. Understand The Results

Set Compare by to Persona and use the grouped view so each row represents one accent. Compare every accent against Standard Customer, then inspect the accents whose speech recognition, task success, latency, or call shape changed the most. In your analysis, lead with the conclusions that explain what changed:
  • the largest accent regressions compared with Standard Customer
  • the affected speech-recognition, task-success, latency, or call-shape metrics
  • any UNKNOWN, missing, or unscored metric results that point to anomalous conversations
  • representative simulation links for the most important regressions and one healthy baseline
  • Human Review results or reviewer agreement, if you used manual labels
  • the recommended next step from your report analysis, such as prompt changes, STT/confirmation adjustments, accent-robust routing, trace setup, or expanded accent coverage
To have an AI agent produce this analysis from the report, use the analyze-accent-report skill:
Use the Coval `analyze-accent-report` skill.

I completed the Testing Across Accents workflow and created this multi-run report:
<paste Coval report URL, report export, or run IDs>

Analyze the report by accent against Standard Customer. The report is grouped by Persona in Coval; interpret those personas as the accents being tested. Use metric deltas, UNKNOWN or unscored metrics, call-shape changes such as turn count and audio duration, representative simulations, recordings, transcripts, traces if present, and Human Review labels if present.

Tell me:
- which accents regressed most or became anomalous, and on which metrics
- what likely failed in my agent, especially whether speech recognition mis-transcribed accented speech and whether that changed the task outcome
- the recommended next fix, such as prompt changes, STT/confirmation adjustments, accent-robust routing, trace setup, or expanded accent coverage
- which accents, test cases, and metrics I should rerun after the fix to confirm improvement

Keep agent-side fixes separate from Coval metric or test setup changes. The accent voices are lower-concurrency, so per-accent samples may be small — flag small-sample conclusions. Do not guess from the aggregate table alone; inspect representative simulations when they are available.