Input Types - Coval Documentation

The Simulation Input is the heart of a test case: it becomes the simulated user’s objective for the run. The input type determines how tightly the simulated user follows it — from improvising toward a high-level goal to delivering exact lines:

Type	What it is	Simulated user behavior
Scenario	High-level intent	Improvises freely toward the goal
Transcript	A reference conversation	Adapts as needed to match the flow
Script	Exact turns	Follows them precisely, word for word

Beyond these text inputs, you can also attach an image (WebSocket voice agents only) or a pre-recorded audio file to a test case.

Scenario

Describe the task or situation the simulated user is pursuing. This is the goal, not a script — the simulated user generates its own wording to chase it, in the tone set by its persona. Use quotation marks to provide suggested specific phrases for the persona to say. Examples:

Simple task: “Call to get a refund”
Complex scenario: “First, ask for PTO from the 21st to the 22nd of March. After receiving a confirmation, ask to change to the 20th to 22nd. During the verification, share your email address as ‘emily [at] gmail [dot] com’. Then, proceed to correct yourself with ‘oh no - it’s actually emily [dot] marc [at] gmail [dot] com’.”

The more detailed your scenario, the more precisely the simulated user follows it. A precise input produces a narrow, focused conversation; a broader input produces more varied, improvised queries.

Transcript

Recreate a specific conversation using OpenAI transcript format. The simulated user follows the user side of the transcript as closely as possible, adapting as needed to match the flow. Format example:

[
  {
    "role": "assistant",
    "content": "Welcome to X Restaurant. How may I assist you today?"
  },
  { "role": "user", "content": "I would like to order some pizza." }
]

Script

Define an ordered list of exact turns for the simulated user to deliver, turn by turn. The persona follows the script exactly rather than generating responses with an LLM — while still using the configured persona voice and background sounds. Each turn is one of three types:

Text — the persona speaks the line verbatim.
DTMF — the persona presses keypad digits (0–9, *, #), for navigating IVR menus or entering numbers. Multi-digit values press each digit in sequence.
Skip — the persona stays silent for that turn, e.g. while an IVR plays a long announcement before a keypress is appropriate.

Example script turns:

“Hi, I’d like to check my account balance.” (text)
Press 1 (DTMF)
“Yes, my account number is 12345.” (text)
“Thank you, goodbye.” (text)

How it works:

In the test set editor, select Script as the input type
Add ordered turns in the script editor — speech, DTMF keypresses, or skips
During simulation, the persona delivers each turn in order instead of generating LLM responses
A divergence detector monitors agent responses — if the agent diverges significantly from the expected flow, the simulation can end early with a SCRIPT_DIVERGED reason
After the last scripted turn is delivered, the agent gets one final response before the simulation ends with a SCRIPT_COMPLETED reason

Script test cases give you deterministic persona speech while still exercising the full voice pipeline (TTS, turn-taking, background noise). Use them when you need control over exactly what the persona says but still want realistic audio delivery.

Image attachment

Image attachments work only with WebSocket voice agents. Other agent and simulator types ignore attached images.

Attach a single image to a test case so the simulated user can share it during a WebSocket voice conversation — useful for flows like sending a receipt, damage photo, insurance card, or product image after the agent asks for visual proof or context. Unlike the input types above, an image doesn’t replace the test case’s input; it’s an extra artifact the persona sends when relevant, on top of the scenario, transcript, or script. Supported formats: .png, .jpg, .jpeg (max 2 MB, one image per test case). How to attach an image:

In the test set editor, open a test case and click Add Media.
Upload a PNG or JPEG image and give it a short Name such as receipt_photo or broken_screen.
Optionally add a Description telling the persona when the image should be sent.
Attach the test set to a WebSocket voice agent with a media send template configured.
Launch the run using that attached WebSocket voice agent. During the conversation, Coval sends the image when the agent asks for relevant visual information.

Best practices:

Use short, stable names like receipt_photo or drivers_license_front.
Use the description to explain when to send the image, not just what it contains.
Keep the image tightly scoped to the task so the agent receives only the evidence it needs.

See WebSocket for payload configuration.

Audio upload

Upload a pre-recorded audio file — the simulated user’s side of the conversation — that plays back exactly as recorded instead of generating persona speech with an LLM. Use it when you need repeatable, deterministic persona behavior, such as replaying real caller audio. The audio is automatically transcribed, so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call. You can optionally attach a ground truth transcript to each test case to enable the STT Word Error Rate (Audio Upload) metric, which measures your agent’s speech recognition accuracy against the known-correct transcript.

​Scenario

​Transcript

​Script

​Image attachment

​Audio upload

Scenario

Transcript

Script

Image attachment

Audio upload