Overview
WebSocket voice agents stream audio over a single persistent WebSocket connection. Coval can exchange raw binary PCM frames or JSON envelopes that wrap base64-encoded PCM / MP3 audio, plus configured non-audio events (cart updates, session signals) the agent emits. Use this connection type for voice agents that:- Stream audio over WebSocket rather than SIP, WebRTC, or HTTP.
- Receive Coval’s Linear PCM audio at a fixed sample rate and return PCM or MP3 audio.
- Optionally send structured side-events, such as cart updates or session status messages, alongside audio.
Connection modes
| Mode | When to use |
|---|---|
| Direct | The agent exposes a stable wss:// URL Coval can dial directly. |
| HTTP-first | The agent requires an HTTP setup call to provision a per-session WebSocket URL before the audio stream begins. |
websocket_url_response_path, and opens the audio WebSocket against that URL.
Authentication
WebSocket voice agents authenticate during the WebSocket upgrade.- Authorization header — set
authorization_headerto the auth value Coval should send during the WebSocket upgrade. Values likeBearer <ACCESS_TOKEN>andBasic <BASE64_CREDENTIALS>are sent as theAuthorizationheader value. Values likeX-API-Key <KEY>are sent as theX-API-Keyheader. - Query-string token — when the agent only supports browser-style auth, encode the token directly in the
endpoint, for examplewss://example.com/ws?token=.... - Custom headers —
custom_headersaccepts additional upgrade headers. In the UI, add header name/value rows. Through the API, sendmetadata.custom_headersas a JSON object or as a JSON-encoded object string, for example{"X-Foo":"bar"}or"{\"X-Foo\":\"bar\"}".
authorization_header; use custom_headers for additional named headers. Tokens included directly in the endpoint query string may be visible anywhere URLs are logged, so prefer authorization_header when the agent supports it.
Audio transport
Audio can be exchanged as raw PCM bytes or as JSON envelopes containing a base64-encoded audio payload. The default JSON shape isaudio_chunk / data; the JSON audio preset uses audio_message / audio_bytes. Both JSON shapes are configurable per agent, and setting send_audio_template to exactly {{audio_data}} makes outbound audio raw bytes instead.
Coval’s simulator only sends Linear PCM. The JSON audio preset uses:
- Codec: PCM (linear)
- Sample rate: 16 000 Hz
- Bit depth: 16-bit
- Endianness: little-endian
- Channels: 1 (mono)
- Recommended frame duration for peer implementations: 20-100 ms
audio_message_type_value to identify the agent frames that contain inbound audio, and use send_audio_template to shape Coval-originated audio frames. For the JSON audio preset, Coval sends audio_message frames with sender: "USER" and the agent should send its own audio_message frames with sender: "AI".
Audio format fields
| Field | Default | Purpose |
|---|---|---|
endpoint | – (required) | wss:// URL Coval connects to in direct mode. Plain ws://, http://, and https:// endpoints are rejected for direct WebSocket connections. |
connection_mode | direct | direct or http_first. |
initialization_json | empty | Optional JSON object Coval sends after the WebSocket upgrade and before any ready-message wait. |
send_audio_template | {"type":"audio_chunk","data":"{{audio_data}}"} | Outbound JSON template. Must contain {{audio_data}}. Setting it to literally {{audio_data}} sends raw PCM bytes (no JSON wrapping). |
message_type_path | type | Dot-notation path to the field that names the message kind. |
audio_message_type_value | audio_chunk | Value that identifies an inbound audio frame. Use * to treat every JSON message as audio. |
audio_data_path | data | Dot-notation path to the base64 audio payload inside an inbound frame. |
audio_encoding | pcm | Inbound JSON audio payload encoding: pcm or mp3. MP3 frames are decoded to 16 kHz mono PCM before evaluation. |
receive_audio_channels | 2 | 1 for mono inbound JSON PCM, 2 to keep the legacy stereo-to-mono averaging behavior. |
send_sample_rate_hertz | 16000 | Outbound sample rate Coval sends to the agent. Allowed: 8 000, 16 000, 24 000, 48 000. |
receive_sample_rate_hertz | 48000 | Sample rate the agent sends. Allowed: 8 000, 16 000, 24 000, 48 000. |
pipeline_sample_rate_hertz | 16000 | Coval processing rate; must stay 16 000. |
pace_inbound_binary_audio | inferred | Pace inbound binary PCM in real time so resampling and metrics see realistic timing. Defaults on when outbound audio is configured for raw PCM bytes and off for JSON templates. |
payload.audio.data. Match send_sample_rate_hertz / receive_sample_rate_hertz to the agent’s actual stream format; mismatched sample rates can cause speed, pitch, or quality issues.
HTTP-first setup fields
These fields apply whenconnection_mode is http_first:
| Field | Default | Purpose |
|---|---|---|
http_url | – (required) | https:// setup endpoint Coval calls before opening the WebSocket. |
http_method | POST | Setup request method. Allowed: GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS. |
http_request_body | {} | JSON object body for the setup request. |
http_headers | {} | JSON object of headers for the setup request. |
websocket_url_response_path | – (required) | Dot-notation path to the WebSocket URL in the setup response, for example data.websocket_url. |
authorization_header | empty | Auth value for the WebSocket upgrade after setup. This is separate from http_headers. |
custom_headers | {} | Additional headers for the WebSocket upgrade after setup. |
Handshake
| Field | Default | Purpose |
|---|---|---|
handshake_ready_message_type | session_ready in direct mode; empty in HTTP-first mode | Set to an empty string to skip the ready-message wait. |
handshake_requires_session_id | true in direct mode; false in HTTP-first mode | When true, the ready message must include session_id. |
handshake_timeout_seconds | 30 | Seconds Coval waits for the ready message. |
message_type_path of type, a direct-mode ready message looks like:
message_type_path, Coval uses that same path to find the ready-message type.
Non-audio event capture
Many voice agents emit side-events alongside the audio stream — cart updates, transcript fragments, session telemetry. By default, Coval ignores non-audio JSON messages. To tell Coval which message types to accept, set:event_type— the value atmessage_type_path(for examplesystem_notify).event_name— the optionaleventfield from the payload (for exampleocb:cart-updated).payload— the full parsed JSON message.
message_type_path is action and non_audio_event_message_types includes system_notify, this inbound message is accepted as a non-audio event:
websocket_event entries. Transcript-based metrics, including LLM judge metrics, see JSON that includes event_type, event_name, and the full payload, so they can evaluate structured side-channel data such as cart contents, selected menu items, modifiers, quantities, and prices alongside the spoken conversation.
Media (image) frames
Voice WebSocket simulations can attach images from a test case mid-conversation.send_media_template controls the outbound shape:
{{media_data}}is required.{{media_name}}and{{mime_type}}are optional placeholders.- If the template is exactly
{{media_data}}, Coval sends raw bytes. - Otherwise, Coval base64-encodes the image and substitutes it into your JSON template.
Examples
Initialization payload:JSON audio preset
The agent UI ships aJSON audio preset that fills the metadata for JSON audio WebSocket agents. It sets:
authorization_header to Bearer <ACCESS_TOKEN> after picking the preset if the agent requires auth (most production endpoints do).
Setup
- Prepare the agent endpoint. Confirm
wss://is reachable, audio format matches the configuration above, and decide whether the agent requires Bearer auth. - Create the agent in Coval. Open the Agents page in your Coval org, choose WebSocket as the connection type, and either fill the fields manually or apply the JSON audio preset.
- Smoke test. Build a small test set with a single voice persona and run a simulation. The transcript should show alternating turns, the result page should expose usable audio, and any configured side-events should be available to transcript-based metrics.
How simulations work
- Coval performs any HTTP-first setup, then opens the WebSocket with any configured Bearer token or custom headers.
- If
handshake_ready_message_typeis set, Coval waits for the ready message before sending audio. - Coval streams persona audio outward using
send_audio_templateat the configured sample rate: raw PCM bytes for{{audio_data}}, or JSON text frames for any JSON template. - Inbound binary frames or matching JSON audio frames are decoded and resampled if needed.
- Inbound non-audio JSON messages whose type is in
non_audio_event_message_typesare accepted; unconfigured non-audio messages are ignored. - When the persona finishes, Coval closes the WebSocket cleanly.
Troubleshooting
Empty transcript with audio frames flowing. Check thataudio_message_type_value matches the agent’s field, that audio_data_path points at the base64 payload, and that audio_encoding matches the wire format.
Inbound audio sounds half-speed or distorted. Confirm receive_audio_channels. JSON PCM that arrives mono should be configured with receive_audio_channels: 1; the historical default 2 averages two channels and halves the apparent rate when the source is mono.
Cart events / status messages look ignored. Add the action value to non_audio_event_message_types. Without it, Coval ignores non-audio JSON messages.
Auth failures during handshake. Verify the authorization_header value, or move the token to a ?token=... query string when the agent only supports browser-style auth.
Connection refused locally. Tunnel the agent’s ws:// server through ngrok or Cloudflare Tunnel and use the resulting wss:// URL as the agent endpoint.
https:// URL, use the corresponding wss:// URL in Coval. Update the agent configuration when the tunnel URL changes, or use a reserved tunnel domain for a stable endpoint.
Unreadable audio or media payloads. For JSON audio/media templates, Coval substitutes base64 data into {{audio_data}} / {{media_data}}; for raw templates, the agent must expect raw PCM or media bytes. Verify the JSON is valid, the configured message fields match the agent payload, audio_encoding is correct, and send_media_template includes {{media_name}} / {{mime_type}} when the agent needs file metadata.
Timeouts or no response. Confirm the agent keeps the WebSocket open for the whole conversation, processes incoming audio frames without blocking, sends audio responses in the configured shape, and logs initialization / ready messages while testing.
Best practices
- Pick the JSON audio preset (or a similar named preset) instead of hand-filling fields when one exists. It keeps the metadata canonical for the agent shape.
- Mirror the agent’s sample rate exactly in
send_sample_rate_hertz/receive_sample_rate_hertz. Resampling is supported but degrades audio. - Capture the side-events you care about by adding their
actionvalues tonon_audio_event_message_types. Don’t silently rely on the agent emitting them. - Keep the agent’s WebSocket handler long-lived and avoid closing the connection while the simulation is active.
- Log initialization payloads, ready messages, and payload parsing errors during initial setup.
- Rotate Bearer tokens on a schedule; Coval re-reads the value at every connection setup.

