> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coval.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Input Types

> Scenario, transcript, or script — how tightly the simulated user follows your input.

The **Simulation Input** is the heart of a test case: it becomes the simulated user's objective for the run. The input type determines how tightly the simulated user follows it — from improvising toward a high-level goal to delivering exact lines:

| Type           | What it is               | Simulated user behavior               |
| -------------- | ------------------------ | ------------------------------------- |
| **Scenario**   | High-level intent        | Improvises freely toward the goal     |
| **Transcript** | A reference conversation | Adapts as needed to match the flow    |
| **Script**     | Exact turns              | Follows them precisely, word for word |

Beyond these text inputs, you can also attach an [image](#image-attachment) (WebSocket voice agents only) or a [pre-recorded audio file](#audio-upload) to a test case.

## Scenario

Describe the task or situation the simulated user is pursuing. This is the **goal**, not a script — the simulated user generates its own wording to chase it, in the tone set by its [persona](/concepts/test-sets/overview#test-case-vs-persona). Use quotation marks to provide suggested specific phrases for the persona to say.

Examples:

* Simple task: "Call to get a refund"
* Complex scenario: "First, ask for PTO from the 21st to the 22nd of March. After receiving a confirmation, ask to change to the 20th to 22nd. During the verification, share your email address as 'emily \[at] gmail \[dot] com'. Then, proceed to correct yourself with 'oh no - it's actually emily \[dot] marc \[at] gmail \[dot] com'."

The more detailed your scenario, the more precisely the simulated user follows it. A precise input produces a narrow, focused conversation; a broader input produces more varied, improvised queries.

## Transcript

Recreate a specific conversation using OpenAI transcript format. The simulated user follows the user side of the transcript as closely as possible, adapting as needed to match the flow.

Format example:

```json theme={null}
[
  {
    "role": "assistant",
    "content": "Welcome to X Restaurant. How may I assist you today?"
  },
  { "role": "user", "content": "I would like to order some pizza." }
]
```

## Script

Define an ordered list of exact turns for the simulated user to deliver, turn by turn. The persona follows the script exactly rather than generating responses with an LLM — while still using the configured persona voice and background sounds.

Each turn is one of three types:

* **Text** — the persona speaks the line verbatim.
* **DTMF** — the persona presses keypad digits (`0`–`9`, `*`, `#`), for navigating IVR menus or entering numbers. Multi-digit values press each digit in sequence.
* **Skip** — the persona stays silent for that turn, e.g. while an IVR plays a long announcement before a keypress is appropriate.

Example script turns:

1. "Hi, I'd like to check my account balance." *(text)*
2. Press `1` *(DTMF)*
3. "Yes, my account number is 12345." *(text)*
4. "Thank you, goodbye." *(text)*

**How it works:**

1. In the test set editor, select **Script** as the input type
2. Add ordered turns in the script editor — speech, DTMF keypresses, or skips
3. During simulation, the persona delivers each turn in order instead of generating LLM responses
4. A divergence detector monitors agent responses — if the agent diverges significantly from the expected flow, the simulation can end early with a `SCRIPT_DIVERGED` reason
5. After the last scripted turn is delivered, the agent gets one final response before the simulation ends with a `SCRIPT_COMPLETED` reason

<Tip>
  Script test cases give you deterministic persona speech while still exercising the full voice pipeline (TTS, turn-taking, background noise). Use them when you need control over exactly what the persona says but still want realistic audio delivery.
</Tip>

## Image attachment

<Note>
  Image attachments work **only with WebSocket voice agents**. Other agent and simulator types ignore attached images.
</Note>

Attach a single image to a test case so the simulated user can share it during a WebSocket voice conversation — useful for flows like sending a receipt, damage photo, insurance card, or product image after the agent asks for visual proof or context. Unlike the input types above, an image doesn't replace the test case's input; it's an extra artifact the persona sends when relevant, on top of the scenario, transcript, or script.

Supported formats: `.png`, `.jpg`, `.jpeg` (max 2 MB, one image per test case).

**How to attach an image:**

1. In the test set editor, open a test case and click **Add Media**.
2. Upload a PNG or JPEG image and give it a short **Name** such as `receipt_photo` or `broken_screen`.
3. Optionally add a **Description** telling the persona when the image should be sent.
4. Attach the test set to a WebSocket voice agent with a media send template configured.
5. Launch the run using that attached WebSocket voice agent. During the conversation, Coval sends the image when the agent asks for relevant visual information.

**Best practices:**

* Use short, stable names like `receipt_photo` or `drivers_license_front`.
* Use the description to explain *when* to send the image, not just what it contains.
* Keep the image tightly scoped to the task so the agent receives only the evidence it needs.

See [WebSocket](/concepts/agents/connections/websocket#media-send-template) for payload configuration.

## Audio upload

Upload a pre-recorded audio file — the simulated user's side of the conversation — that plays back exactly as recorded instead of generating persona speech with an LLM. Use it when you need repeatable, deterministic persona behavior, such as replaying real caller audio. The audio is automatically transcribed, so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call.

You can optionally attach a ground truth transcript to each test case to enable the [STT Word Error Rate (Audio Upload)](/concepts/metrics/types/trace#stt-word-error-rate-audio-upload) metric, which measures your agent's speech recognition accuracy against the known-correct transcript.