# Coval > Coval is the reliability infrastructure for conversational AI agents. Evaluate voice and text AI agents by running simulated conversations and measuring performance with metrics. Supports inbound voice, outbound voice, chat, SMS, WebSocket, Pipecat, and LiveKit agent types. ## Getting Started - [Welcome](https://docs.coval.ai/getting-started/welcome): What Coval does and who it's built for - [Quick Start](https://docs.coval.ai/getting-started/quick-start): Set up your first evaluation in 5 minutes - [GitHub Actions](https://docs.coval.ai/getting-started/github-actions-tutorial): CI/CD integration for automated evaluations on every PR ## Concepts - [Agents](https://docs.coval.ai/concepts/agents/overview): Connect voice and chat AI agents to Coval for evaluation - [Mutations](https://docs.coval.ai/concepts/agents/mutations): Test agent configuration variants side-by-side (A/B testing) - [Attributes](https://docs.coval.ai/concepts/attributes/overview): Tag and filter resources with custom attributes - [Personas](https://docs.coval.ai/concepts/personas/overview): Configure simulated callers with voice, accent, behavior, and background noise - [Test Sets](https://docs.coval.ai/concepts/test-sets/overview): Define scenarios, transcripts, scripts, and expected behaviors for evaluation - [Metrics](https://docs.coval.ai/concepts/metrics/overview): Measure agent performance with LLM judges, audio analysis, regex, and tool call checks - [Built-in Metrics](https://docs.coval.ai/concepts/metrics/built-in-metrics): Pre-built metrics for latency, interruptions, sentiment, call resolution, and audio quality - [Metric Prompting](https://docs.coval.ai/concepts/metrics/prompting): Writing effective LLM judge prompts and expected behaviors - [Metric Chaining](https://docs.coval.ai/concepts/metrics/MetricChaining): Chain metrics for complex multi-step evaluation logic - [Human Review](https://docs.coval.ai/concepts/metrics/human-review/human-review): Human-in-the-loop review for metric calibration - [Templates](https://docs.coval.ai/concepts/templates/overview): Reusable evaluation configs bundling agent, test set, persona, and metrics - [Simulations](https://docs.coval.ai/concepts/simulations/overview): Launch and analyze simulated conversations between your agent and personas - [Multi-Run Analysis](https://docs.coval.ai/concepts/simulations/multi-run-analysis): Compare results across multiple evaluation runs - [OpenTelemetry Traces](https://docs.coval.ai/concepts/simulations/traces/opentelemetry): Correlate evaluation results with production traces ## Agent Connections - [Inbound Voice](https://docs.coval.ai/concepts/agents/connections/inbound-voice): Test agents that receive incoming phone calls - [Outbound Voice](https://docs.coval.ai/concepts/agents/connections/outbound-voice): Test agents that make outbound calls - [Chat (OpenAI Endpoint)](https://docs.coval.ai/concepts/agents/connections/openai-endpoint): Connect OpenAI-compatible chat APIs - [Chat WebSocket](https://docs.coval.ai/concepts/agents/connections/chat-websocket): Text chat over persistent WebSocket connections - [Pipecat](https://docs.coval.ai/concepts/agents/connections/pipecat): Integrate with Pipecat Cloud agents - [LiveKit](https://docs.coval.ai/concepts/agents/connections/livekit): Real-time communication platform integration - [WebSocket](https://docs.coval.ai/concepts/agents/connections/websocket): Generic WebSocket agent connection ## Observability - [Dashboards](https://docs.coval.ai/concepts/dashboard/overview): Visualize performance trends across evaluation runs - [Monitoring](https://docs.coval.ai/concepts/monitoring/overview): Evaluate production conversations with live monitoring - [Trace Search](https://docs.coval.ai/concepts/monitoring/trace-search): Search across all traced calls with natural language queries, structured filters, and Transition Hotspots - [Improving Metrics with Human Review](https://docs.coval.ai/guides/improving-metrics-with-human-review): Calibrate metrics using human feedback loops ## Guides - [Inbound Voice Simulations](https://docs.coval.ai/guides/simulations/inbound-voice): Step-by-step guide for inbound voice evaluations - [Chat Simulations](https://docs.coval.ai/guides/simulations/chat): Step-by-step guide for chat agent evaluations - [SMS Simulations](https://docs.coval.ai/guides/simulations/sms): Step-by-step guide for SMS agent evaluations - [Outbound Voice](https://docs.coval.ai/guides/outbound-voice): Guide for outbound voice agent testing - [API Keys](https://docs.coval.ai/guides/api-keys): Managing API keys for programmatic access - [Scheduled Runs](https://docs.coval.ai/guides/scheduled-runs): Set up recurring automated evaluations - [Observability](https://docs.coval.ai/guides/observability): OpenTelemetry traces and production monitoring setup - [Human Review API](https://docs.coval.ai/guides/human-review-api): Human-in-the-loop review workflows via API ## CLI - [Overview](https://docs.coval.ai/cli/overview): Command-line interface for evaluation, scripting, and CI/CD - [Installation](https://docs.coval.ai/cli/installation): Install via Homebrew, Cargo, or binary download - [Agents](https://docs.coval.ai/cli/agents): Create, list, update, delete agents - [Runs](https://docs.coval.ai/cli/runs): Launch evaluations, watch progress, view results - [Simulations](https://docs.coval.ai/cli/simulations): Inspect individual simulation results and download audio - [Test Sets](https://docs.coval.ai/cli/test-sets): Manage test set collections - [Test Cases](https://docs.coval.ai/cli/test-cases): Define evaluation inputs, expected outputs, and bulk import - [Personas](https://docs.coval.ai/cli/personas): Configure simulated callers with voice and behavior - [Metrics](https://docs.coval.ai/cli/metrics): Define scoring criteria (LLM judge, audio, regex, tool call) - [Mutations](https://docs.coval.ai/cli/mutations): Manage agent configuration variants - [Run Templates](https://docs.coval.ai/cli/run-templates): Save reusable evaluation configurations - [Scheduled Runs](https://docs.coval.ai/cli/scheduled-runs): Schedule recurring evaluations with cron expressions - [Dashboards](https://docs.coval.ai/cli/dashboards): Create dashboards and widgets from the CLI - [API Keys](https://docs.coval.ai/cli/api-keys): Manage API keys - [Human Review](https://docs.coval.ai/cli/human-review): Human review projects and annotations ## AI Agents - [Evaluations for Agents](https://docs.coval.ai/agents/overview): Give AI coding agents the tools and knowledge to evaluate AI quality via Skills, MCP, CLI, or API - [Guided Onboarding](https://docs.coval.ai/agents/onboarding): Run /onboard to set up a complete evaluation interactively — from connecting your agent to viewing results - [Agent Skills](https://docs.coval.ai/agents/skills): Install evaluation expertise into your AI coding agent with one command (npx skills add coval-ai/coval-external-skills) ## MCP Server - [Overview](https://docs.coval.ai/mcp/overview): Model Context Protocol server for LLM tool access - [Installation](https://docs.coval.ai/mcp/installation): Set up MCP server for Claude, Cursor, and other clients - [Tools](https://docs.coval.ai/mcp/tools): Available MCP tools reference - [Beginner's Guide](https://docs.coval.ai/mcp/beginners-guide): Getting started with MCP and Coval ## API Reference - [Introduction](https://docs.coval.ai/api-reference/v1/introduction): Authentication, base URL, pagination, filtering, and error codes - [OpenAPI Specs](https://api.coval.dev/v1/openapi): Machine-readable API specifications (15 resource specs) ## Optional - [Use Cases](https://docs.coval.ai/use-cases/overview): Example evaluation scenarios by industry - [Leveraging Test Users](https://docs.coval.ai/use-cases/leveraging-test-users): Using test user data for better evaluations - [Hackathons](https://docs.coval.ai/collaborate/hackathons/overview): Community events and collaboration --- # Full Documentation --- ## Welcome to Coval Source: https://docs.coval.ai/welcome Coval is the reliability infrastructure for conversational AI agents. ![](/images/overview-coval.jpg) ## Why Coval? Welcome to Coval - your enterprise-grade conversational agent testing and observability platform. We help **developers**, **QA teams**, and **enterprise operations teams** confidently test, evaluate, and monitor voice and chat agents—before and after they go live. Whether you’re building a new agent, debugging a regression, or comparing vendors, Coval gives you a full-stack platform to simulate real conversations, track production quality, and ship agents you trust. **Built for**: **Developers & QA** – Automate testing across hundreds of scenarios with realistic voice inputs, IVRs, and edge cases **Enterprise Agent Ops** – Monitor multiple vendors, standardize metrics, and scale performance evaluation across use cases **Teams shipping fast** – Integrate Coval into your CI/CD or agent workflows to catch issues early, simulate regressions, and ship with confidence ## **How Coval Provides Value** Coval accelerates AI agent development with automated testing for chat, voice, and other objective-oriented systems. Instead of calling your agent manually, simulate your end customers in realistic settings with background noise, accents, and test a variety of scenarios in automated tests. Coval provides the tools to automatically generate and execute a wide range of test scenarios, ensuring no aspect of your agent's behavior is overlooked. Run GitHub Actions on any change and simulate outcomes, or run automated evaluation on a schedule. We also integrate with partners across the voice stack to embed Coval into your workflows. Keep your customers assured on delivering high-quality results at scale by running evals on calls in production . Smart alerting notifies you on on critical issues and anomalies for early debugging and scaling your agent fast. > ## Start building Follow the "Key Concepts" & "Guides" section to set up your evals & monitoring. Alternatively, use our API to easily manage and integrate your Coval agents. - [API Reference](/api-reference/v1/introduction): Manage evaluation, monitoring, metrics and simulators with the Coval API. ## Keep in touch with us Join the Coval community to ask questions, discuss best practices, and share tips. - [Slack Community](https://forms.gle/frTn8eCHkcTfw67p8) - [LinkedIn](https://www.linkedin.com/company/covaldev/posts/?feedView=all) - [X (Twitter)](https://x.com/covaldev) --- ## Agents Source: https://docs.coval.ai/concepts/agents/overview Connect your voice & chat agents to Coval to launch simulated conversations Each agent configuration acts as a reusable connection profile that can be referenced across multiple simulations, evaluations, and conversations without requiring reconfiguration. ## How to Configure Agents ### Adding an Agent [Image: Connect Agent Demo] 1. Navigate to the Agents section in your dashboard 2. Click "Add New Agent" 3. Configure the connection parameters: - **Endpoint URL**: API endpoint for your agent service - **Phone Number**: For voice-based agents requiring telephony access - **Authentication**: API keys or authentication tokens as required 4. Set operational parameters: - **Language Preferences**: Primary and fallback language configurations - **Agent Behavior Prompts**: System prompts or behavioral guidelines - **Simulator Types**: Compatible simulation environments ### Attributes In your agents, you can set specific attributes associated with that agent. For example, if you have multiple agents representing different restaurant reservation services, you could define the attributes such as "opening_hours" and "menu_items". You can embed these agent attributes into [test case scenarios](/concepts/test-sets/overview#utilizing-agent-attributes) or [metric prompts](/concepts/metrics/prompting#using-agent-attributes-and-test-case-attributes) by inserting `{{agent.attribute_name}}`. In the example above, you could create a metric that asks: ```markdown Did the agent give the correct opening hours? Opening hours are `{{agent.opening_hours}}` ``` or, if you could use it in a test case: ```markdown Order two items from this list: {{agent.menu_items}} ``` ## Workflow Each agent can have a **workflow** — a visual graph that maps out the conversation steps and decision points your agent is designed to handle. Workflows help you document intended behavior, identify gaps in test coverage, and communicate how your agent should move through a conversation. See [Workflow](/concepts/agents/workflow) for details on creating, generating, and editing workflows. ## Knowledge Base Each agent can have a **knowledge base** — a set of reference documents (FAQs, policies, product docs) that LLM metrics can use as a source of truth when evaluating your agent's responses. Attaching accurate reference content lets Coval catch hallucinations, contradictions, and gaps that transcript-only evaluation can miss. See [Knowledge Base](/concepts/agents/knowledge-base) for details on supported source types, how to add entries, and how to enable KB context on metrics. ## Test Your Connection Once your agent is configured and saved, use the **Test connection** button in the agent config page to verify connectivity before running simulations. **What gets tested:** - **WebSocket agents** — Coval performs the WebSocket handshake and reports whether the connection was established successfully. - **HTTP / OpenAI-compatible agents** — Coval sends the initialization request and reports the response. **Reading the result:** - **Success** — A green confirmation appears and auto-dismisses after 60 seconds. - **Failure** — A red alert stays visible with the error details. Expand the result to see the raw JSON response for debugging. The button is disabled when you have unsaved changes. Save first, then test. ## Connect Your Agent - [Inbound Voice](/concepts/agents/connections/inbound-voice): Receive incoming phone calls for customer service scenarios - [Outbound Voice](/concepts/agents/connections/outbound-voice): Make calls to users for sales and scheduling - [OpenAI Endpoint](/concepts/agents/connections/openai-endpoint): Connect OpenAI-compatible chat APIs - [Chat WebSocket](/concepts/agents/connections/chat-websocket): Text chat over persistent WebSocket connections - [OpenAI Realtime](/concepts/agents/connections/openai-realtime): Voice-to-voice agents on the OpenAI Realtime API - [Gemini Live](/concepts/agents/connections/gemini-live): Voice-to-voice agents on Google Gemini Live - [Pipecat Cloud](/concepts/agents/connections/pipecat): Integrate with Pipecat Cloud agents - [LiveKit](/concepts/agents/connections/livekit): Advanced real-time communication platform --- ## Workflow Source: https://docs.coval.ai/concepts/agents/workflow Visualize and define your agent's conversation flow as a structured graph of nodes and edges A **workflow** is a visual map of your agent's conversation logic — a directed graph that shows the steps your agent can take, the decisions it makes, and how one state transitions to the next. Workflows live inside an agent configuration. They don't change how your agent behaves at runtime — instead, they give Coval (and you) a structured model of the intended conversation flow that can power simulation guidance, test coverage analysis, and documentation. ## Why Use a Workflow Without a workflow, simulations run without a reference model of the intended conversation structure. With one, Coval can: - Guide simulations to follow expected paths and decision points - Identify gaps in test coverage (steps that aren't exercised) - Provide a visual reference for how your agent should handle conversations - Catch deviations from expected behavior during evaluation **Common use cases:** - Document the intended flow for complex multi-step interactions - Ensure test cases cover all critical decision branches - Onboard new team members with a visual conversation map - Validate that simulations are exercising the full conversation tree ## What a Workflow Looks Like A workflow is made up of: - **Nodes** — individual steps or states in the conversation (e.g., _Greet Caller_, _Verify Identity_, _Handle Cancellation Request_, _Escalate to Human_) - **Edges** — directed connections between nodes that represent valid transitions, optionally labeled with conditions (e.g., _"Confirmed"_, _"Declined"_, _"Timeout"_) Together, nodes and edges form a DAG (directed acyclic graph) representing your agent's expected conversation paths. Loops are flagged with a warning — each step should lead forward. ## Adding a Workflow to an Agent Navigate to your agent's configuration page and open the **Workflow** tab. ### Generate from Agent Description If your agent has a system prompt configured, click **Generate** to automatically create a workflow from it. Coval uses your agent's name and system prompt to infer the conversation steps and decision points. If your agent doesn't have a system prompt yet, you'll be prompted to describe the workflow manually before generating. You can also provide **additional requirements** in the generation dialog — for example: > "Include an escalation path when the caller asks for a supervisor" ### Edit an Existing Workflow Once a workflow exists, you can refine it using two tabs: - **Edit tab** — describe a change in plain language and click **Apply**. Use the quick-action buttons to add decision points, process steps, error paths, or edge labels. - **JSON tab** — edit the workflow graph directly as JSON for precise control. Coval validates the structure on save. Use **Undo** to revert to the previous version at any point. ### Expand the Canvas Click the expand icon to open the workflow in a full-screen modal for a better view of complex graphs. ## Workflow Data Format A workflow is stored as JSON with two top-level arrays: ```json { "nodes": [ { "id": "greet", "data": { "label": "Greet Caller", "description": "Agent introduces itself" } }, { "id": "verify", "data": { "label": "Verify Identity", "description": "Confirm caller account" } }, { "id": "resolve", "data": { "label": "Resolve Issue", "description": "Address the caller's request" } }, { "id": "close", "data": { "label": "Close Call", "description": "Summarize and end the conversation" } } ], "edges": [ { "id": "e1", "source": "greet", "target": "verify" }, { "id": "e2", "source": "verify", "target": "resolve", "label": "Verified" }, { "id": "e3", "source": "verify", "target": "close", "label": "Unverified" }, { "id": "e4", "source": "resolve", "target": "close" } ] } ``` | Field | Description | |---|---| | `nodes[].id` | Unique identifier for the step | | `nodes[].data.label` | Display name shown on the canvas | | `nodes[].data.description` | Optional description of what happens at this step | | `edges[].source` | ID of the originating node | | `edges[].target` | ID of the destination node | | `edges[].label` | Optional condition label for the transition | ## Workflow Validation Coval validates workflows on save and flags common issues: - **Cycles / loops** — a node that transitions back to an earlier step. Each step should lead forward in the conversation. - **Missing nodes** — an edge references a node ID that doesn't exist. - **Disconnected nodes** — nodes with no incoming or outgoing edges. Warnings appear as banners on the canvas and in the JSON editor. You can still save a workflow with warnings, but review them before using the workflow to guide simulations. --- ## Knowledge Base Source: https://docs.coval.ai/concepts/agents/knowledge-base Attach reference documents and content to your agent so metrics can evaluate against authoritative sources A **knowledge base** is a collection of reference documents attached to an agent. When enabled on a metric, the knowledge base is provided as context to the LLM evaluator — allowing it to check whether your agent's responses are accurate against a specific source of truth rather than relying on the model's general knowledge. ## Why Use a Knowledge Base Without a knowledge base, an LLM metric can only reason about what it observes in the conversation transcript. With one, it can cross-reference agent responses against your actual documentation — catching factually incorrect answers, hallucinations, and gaps in coverage. **Common use cases:** - Verify that agents answer questions using approved FAQ or policy content - Detect when an agent contradicts product documentation - Track accuracy across different knowledge sources (e.g., a billing policy vs. a returns policy) - Ensure compliance with healthcare, legal, or regulatory information ## Adding Knowledge Base Entries Knowledge base entries are configured per agent, on the agent's **Knowledge Base** tab. 1. Navigate to your agent's configuration page 2. Open the **Knowledge Base** tab 3. Click **Add Knowledge Base Entry** 4. Choose a source type (see below) 5. Give the entry a descriptive name (e.g., `Hotel FAQ`, `Returns Policy`) 6. Optionally add tags for organization 7. Click **Upload** or **Save** All entries are associated with the agent and become available as context for metrics. ### Supported Source Types | Type | Description | |---|---| | `plain_text` | Paste or type text content directly | | `json` | Structured JSON data | | `file` | Upload a `.txt`, `.pdf`, or `.docx` file — Coval extracts and indexes the content automatically | | `web_url` | Reference a URL (content is stored as a reference; not fetched/indexed automatically) | | `zendesk` | Zendesk Help Center article reference | | `shelf` | Shelf knowledge base reference | > **Tip:** For best results with LLM evaluation, use `plain_text` or `file` entries — these are fully chunked and embedded, making them the most reliable source for semantic retrieval during evaluation. ## How It Works When you upload a `plain_text`, `json`, or `file` entry, Coval automatically: 1. Extracts the text content (parsing PDFs and DOCX files as needed) 2. Splits the content into overlapping chunks 3. Embeds each chunk using a vector embedding model 4. Stores the chunks indexed to the agent At evaluation time, when a metric has **Knowledge Base** enabled, the relevant chunks are retrieved and included in the LLM evaluator's context alongside the conversation transcript. ## Using Knowledge Base in Metrics Once you've added entries to your agent's knowledge base, you can enable KB context on any LLM Judge metric (Binary, Numerical, or Categorical) or on Composite Evaluation metrics. **To enable for a metric:** 1. Open the metric configuration 2. Scroll to the **Knowledge Base** toggle 3. Enable it > **Warning:** If you don't enable the Knowledge Base toggle on a metric, the metric will evaluate without KB context and may produce inaccurate results even when the entries are configured on the agent. See [Knowledge Base Metrics](/concepts/metrics/prompting#knowledge-base-metrics) for prompt writing guidance and examples. ## Generating Test Cases from Knowledge Base Documents If you upload a Conversation Design Document (CDD) — a structured file describing your agent's expected conversation flows — Coval can automatically extract test scenarios from it and generate a test set. After uploading a CDD entry, click **Generate Test Cases** to run the extraction. Coval uses multiple prompting strategies to pull out scenarios, conditions, and expected responses, then creates a linked test set you can use in simulations. --- ## Agent Mutations Source: https://docs.coval.ai/concepts/agents/mutations Test agent configuration variants side-by-side in a single evaluation Agent mutations let you create configuration variants of an agent and run them alongside the base agent in a single evaluation. Instead of manually creating separate agents for each configuration you want to test, mutations override specific fields and produce side-by-side results so you can compare performance directly. ## How Mutations Work Every agent has a base configuration: model, system prompt, temperature, voice, API key, and other settings depending on the agent type. A mutation overrides one or more of these fields while leaving the rest unchanged. When you launch an evaluation with mutations selected, Coval runs the base agent **plus** each selected mutation as separate simulations. The results appear in the same run, grouped by variant, so you can compare metrics across configurations. > **Info:** The base agent always runs. Mutations are additional variants that run alongside it. ### Deep Merge Mutation overrides are deep-merged with the base agent configuration. For nested objects (like custom headers), this means individual keys are overridden while sibling keys are preserved. ```json // Base agent metadata { "custom_sip_headers": { "X-Session-Id": "abc", "X-Routing": "prod" } } // Mutation override { "custom_sip_headers": { "X-Session-Id": "override-123" } } // Merged result at runtime { "custom_sip_headers": { "X-Session-Id": "override-123", "X-Routing": "prod" } } ``` For scalar values (strings, numbers), the mutation value replaces the base value entirely. ## Supported Agent Types Mutations are available for agent types that have overridable configuration fields: | Agent Type | Overridable Fields | |---|---| | **WebSocket** | Endpoint, initialization JSON, authorization header, custom headers | | **Inbound Voice (SIP)** | Custom SIP headers | ## Creating Mutations 1. Navigate to your agent's detail page 2. Scroll to the **Mutations** section 3. Click **Create** ### Mutation Form Each mutation requires: - **Name** -- A descriptive label (e.g., "High Temperature", "French Language", "GPT-4 Turbo") - **Description** -- Optional context about what this variant tests Then select which fields to override. Only fields defined for your agent type are available. For each selected field, enter the override value using the appropriate editor: - **Text input** for simple string or number fields - **Dropdown** for fields with predefined options (e.g., model selection, voice) - **JSON editor** for structured data fields (e.g., custom headers) > **Tip:** Only override the fields you want to test. Fewer overrides make results easier to interpret. ## Using Mutations in Evaluations When launching an evaluation or creating a template: 1. Select a single agent 2. The **Agent Mutations** selector appears if the agent has active mutations 3. Check the mutations you want to include The base agent is always included. Each selected mutation creates additional simulations in the same run. ### Viewing Results After the evaluation completes: - The **Mutation** column in the results table shows which variant produced each result - Use the **filter dropdown** to isolate results by mutation - On the **Dashboard**, use the "Agent Mutation" dimension to split charts by variant This lets you directly compare metrics like latency, accuracy, or custom scores across your base agent and each mutation. ## Common Use Cases **Model comparison** -- Create mutations for different models (GPT-4o vs GPT-4 Turbo vs Claude) to benchmark quality and latency. **Prompt tuning** -- Test system prompt variations to find the wording that produces the best metric scores. **Temperature sweeps** -- Run the same agent at different temperature values to find the optimal balance of creativity and consistency. **Voice testing** -- Compare different TTS voices or providers to find the best fit for your use case. **Header overrides** -- For SIP-based voice agents, test different routing headers or session metadata per variant. ## Managing Mutations ### Editing Click the edit icon on any mutation card to modify its name, description, or overrides. ### Deleting Click the delete icon to archive a mutation. If the mutation is referenced by active templates, you will see a warning listing those templates. You can cancel or force-delete, which causes those templates to skip the deleted mutation on future runs. > **Warning:** Archived mutations cannot be restored. Create a new mutation if you need the same configuration again. ## API and CLI Mutations can also be managed programmatically: - [REST API](/api-reference/v1/mutations/mutations/list-mutations): Create, list, update, and delete mutations via the v1 API - [CLI](/cli/mutations): Manage mutations from the command line --- ## Attributes Source: https://docs.coval.ai/concepts/attributes/overview Use dynamic attributes from agents, test cases, and simulations in your metric prompts and test scenarios. You can embed dynamic values from agents, test cases, and simulations into your metric prompts and test scenarios using template variables. The system automatically replaces these placeholders with actual values during evaluation. ## Supported Sources The template system supports three main sources: - **`{{agent.*}}`** - References agent attributes - **`{{test_case.*}}`** - References test case attributes - **`{{test_case.expected_behaviors}}`** - References test case criteria (used by Composite Evaluation) ## Basic Usage The simplest form is accessing a top-level attribute: ``` {{agent.attribute_name}} {{test_case.attribute_name}} {{test_case.expected_behaviors}} //For criteria in your test case (used by Composite Evaluation) ``` **Example:** Imagine one agent has the attribute **location** with value "San Francisco", and another agent has value "London". ``` Scenario: You are a user calling for travel recommendations in {{agent.location}} Criterion: The agent should only give travel recommendations in {{agent.location}} ``` ## Nested Paths You can access nested attributes using dot notation: ``` {{agent.users.sam.flight_number}} → agent.attributes["users"]["sam"]["flight_number"] ``` **Example:** If your agent has a nested structure: ```json { "users": { "sam": { "flight_number": "UA123", "email": "sam@example.com" } } } ``` You can access it in your prompt: ``` Did the agent identify the flight as: {{agent.users.sam.flight_number}}. ``` ## Array Indexing Access specific elements in arrays using bracket notation: ``` {{test_case.test[0]}} → test_case.attributes["test"][0] ``` **Example:** If your test case has an array: ```json { "flight_options": ["United Airlines", "Delta", "American Airlines"] } ``` You can reference specific flights: ``` The first flight option is {{test_case.flight_options[0]}}. The assistant should mention {{test_case.flight_options[1]}} as an alternative. ``` ## Array Access Without Indexing Access entire arrays without indexing - they'll be returned as strings: ``` {{test_case.test}} → test_case.expected_output_json["test"] (entire array as string) {{agent.items}} → agent.attributes["items"] (entire array as string) ``` **Example:** ``` The available options are: {{test_case.flight_options}} ``` ## Dynamic Keys via Multi-Pass Resolution The system supports dynamic key resolution through multiple passes, allowing you to use one template variable to determine another: ``` {{agent.users.{{test_case.username}}.email}} ``` **How it works:** 1. First pass: Resolves `{{test_case.username}}` → "user1" 2. Second pass: Resolves `{{agent.users.user1.email}}` → "test@example.com" **Example:** If your test case specifies a username: ```json { "username": "user1" } ``` And your agent has user-specific data: ```json { "users": { "user1": { "email": "user1@example.com", "tier": "premium" } } } ``` You can use dynamic resolution: ``` The user {{test_case.username}} has email {{agent.users.{{test_case.username}}.email}} and is a {{agent.users.{{test_case.username}}.tier}} member. ``` ## Complete Example Here's a comprehensive example combining multiple features: **Agent attributes:** ```json { "location": "San Francisco", "users": { "sam": { "tier": "premium", "perks": ["early_checkin", "room_upgrade"] } } } ``` **Test case:** ```json { "username": "sam", "requested_perks": ["early_checkin"] } ``` **Metric prompt:** ``` Given the transcript, did the assistant properly handle the reservation request? Hotel Location: {{agent.location}} Customer: {{test_case.username}} Customer Tier: {{agent.users.{{test_case.username}}.tier}} Available Perks: {{agent.users.{{test_case.username}}.perks}} Requested Perks: {{test_case.requested_perks}} Return YES if: • The assistant confirmed the reservation is for {{agent.location}} • The assistant recognized {{test_case.username}} as a {{agent.users.{{test_case.username}}.tier}} member • The assistant mentioned available perks: {{agent.users.{{test_case.username}}.perks}} • The assistant processed the requested perk: {{test_case.requested_perks[0]}} Return NO if: • The assistant provided incorrect location information • The assistant failed to recognize the customer's tier status • The assistant couldn't access the requested perk information ``` ## Use Cases ### In Metric Prompts Attributes are commonly used in metric prompts to create context-aware evaluations. See [Metric Prompting](/concepts/metrics/prompting) for examples of using attributes in evaluation metrics. ### In Test Scenarios You can embed agent attributes directly into test case scenarios and expected behaviors. This allows the same test set to work with different agents that have different attribute values. See [Test Sets](/concepts/test-sets/overview) for more information. ### In Criteria Use attributes in criteria definitions to create dynamic validation that adapts to the specific agent or test case being evaluated. These criteria are used by the **Composite Evaluation** metric. --- ## Inbound Voice Source: https://docs.coval.ai/concepts/agents/connections/inbound-voice Configure agents that receive incoming phone calls from users ## Overview Inbound Voice connections simulate users calling your agent through traditional telephony. ## Configuration Requirements ### Phone Number - **Field**: `phone_number` - **Type**: String (required) - **Format**: E.164 international format with country code, or a SIP address - **Examples**: `+12345678901`, `sip:agent@example.com` - **Validation**: Must be a valid phone number or SIP URI ### Wideband Audio When your agent is configured with a **SIP address**, a **Wideband Audio (16kHz)** toggle appears in the agent configuration page. Enabling this toggle switches the audio encoding from the default narrowband codec (G.711 mu-law at 8kHz) to wideband L16 PCM at 16kHz, providing higher quality audio. #### When to use wideband audio Wideband audio is beneficial when your agent uses a SIP address (e.g., `sip:agent@example.com`) and supports a wideband codec. In this configuration the audio is 16kHz end-to-end, giving a meaningful quality improvement. This toggle only appears for SIP connections because PSTN phone numbers are limited to narrowband audio (8kHz G.711) on the carrier leg. Even if wideband encoding were used on the Coval side, the PSTN leg would still be narrowband, so there is no actual quality benefit. #### How to enable it 1. In the agent configuration page, select a **SIP address** as your connection type 2. A **Wideband Audio (16kHz)** toggle will appear 3. Enable the toggle to use wideband audio > **Note:** When the wideband toggle is off (the default), standard narrowband PCMU encoding is used. Existing agents are unaffected unless you explicitly enable wideband audio. Note that some PSTN carriers may transcode audio, which can reduce quality regardless of the codec setting. ### Setup Instructions 1. Select "Inbound Voice" as the connection type 2. Enter your agent's phone number in E.164 format or SIP address 3. (Optional) If using a SIP address, enable the **Wideband Audio (16kHz)** toggle for higher quality audio 4. Save and test the configuration ## Technical Details **Call Flow** 1. Coval initiates call to configured phone number 2. Your agent answers and begins conversation 3. Audio processed in real-time 4. Session continues until completion or timeout ## Troubleshooting **Common Issues:** - **Invalid Phone Number Format**: Ensure E.164 format with country code - **Call Connection Failures**: Verify number is active and accessible - **Poor Audio Quality**: Check telephony provider settings. For SIP agents, consider enabling the **Wideband Audio (16kHz)** toggle - **Timeout Issues**: Adjust agent response timing configuration --- ## Inbound Voice Simulations Source: https://docs.coval.ai/guides/simulations/inbound-voice Simulate inbound calls to test your voice agent ## Overview To simulate inbound calls, add your agent's phone number or SIP address in an evaluation template. **Requirements**: Add your agent's phone number or SIP address in a Coval Template. ## How It Works When you launch an evaluation with an inbound voice agent: 1. Coval's simulated user calls your agent's phone number 2. The conversation follows the test set scenarios you've defined 3. The simulated user behaves according to the persona you've configured 4. Metrics are automatically evaluated after the call completes ## Identifying Simulation Calls Coval includes a custom SIP header in the outgoing call: ``` X-Coval-Simulation-Id: ``` This header lets you correlate incoming calls with their corresponding simulation runs in Coval. > **Note:** This header is carried inside the SIP signaling layer. Whether your agent can read it depends on how the call is routed — see below. ### When the Header Is Available **SIP trunking** — If your agent receives calls via a SIP trunk (e.g., Twilio SIP Trunking, Telnyx SIP Trunking, or your own SBC), the custom header is preserved end-to-end and your telephony provider will surface it in the inbound call event. **Regular phone numbers (PSTN)** — If your agent receives calls on a standard phone number, the call routes through the public telephone network. PSTN carriers strip non-standard SIP headers, so `X-Coval-Simulation-Id` will **not** be available to your application. You can still identify simulation calls by the calling phone number or by matching timestamps in the Coval dashboard. ### Retrieving the Header (SIP Trunking) How you access custom SIP headers depends on your telephony provider. **Twilio** Twilio exposes custom SIP headers as parameters prefixed with `SipHeader_` on incoming call webhooks. The simulation ID will be available as `SipHeader_X-Coval-Simulation-Id` in the request parameters sent to your webhook. See [Twilio's SIP documentation](https://www.twilio.com/docs/voice/api/sip-interface) for more details. > **Note:** **Using Twilio Programmable Voice over a regular phone number (PSTN)?** Standard phone numbers route over the public telephone network, which strips SIP headers — `SipHeader_X-Coval-Simulation-Id` will not be available. Use the `pre_call_webhook_url` approach instead: Coval will notify your agent of the simulation ID before each call. See the [Twilio ConversationRelay guide](/guides/simulations/twilio-conversationrelay) for setup instructions. **Telnyx** Telnyx includes custom SIP headers in the `sip_headers` array on the `call.initiated` webhook event. Look for the header with the name `X-Coval-Simulation-Id` in that array. See [Telnyx's SIP documentation](https://developers.telnyx.com/docs/voice/sip-trunking) for more details. ## Firewall & IP Allowlist When Coval places a simulation call to your agent, the call arrives from specific IP addresses. If your telephony infrastructure has firewall rules, access control lists (ACLs), or security groups that restrict which IPs can send SIP traffic, you must allowlist the IPs below — otherwise simulation calls will be silently dropped before they reach your agent. > **Note:** This only applies if you receive calls on your own SIP infrastructure (SBC, IP-PBX, or SIP trunk endpoint) and restrict inbound traffic by IP address. If your agent receives calls on a regular phone number through a cloud provider like Twilio or Telnyx, calls arrive through your provider's normal flow and no allowlisting is needed. Two types of traffic need to be allowed: ### Signaling IPs (SIP) SIP signaling is how calls are initiated (the SIP INVITE that starts the call, plus responses). Allow these IPs on ports **UDP/TCP 5060** and **TLS 5061**. | Region | Primary | Secondary | |--------|---------|-----------| | US | 192.76.120.10 | 64.16.250.10 | | Europe | 185.246.41.140 | 185.246.41.141 | | Australia | 103.115.244.145 | 103.115.244.146 | | Canada | 192.76.120.31 | 64.16.250.13 | | Asia (Beta) | 103.115.244.158 | 103.115.244.159 | ### Media IP Subnets (RTP) RTP carries the actual audio once the call is connected. Allow these subnets on **UDP ports 16384–32768**. ``` 36.255.198.128/25 50.114.136.128/25 50.114.144.0/21 64.16.226.0/24 64.16.227.0/24 64.16.228.0/24 64.16.229.0/24 64.16.230.0/24 64.16.248.0/24 64.16.249.0/24 103.115.244.128/25 185.246.41.128/25 ``` > **Tip:** If simulation calls are failing with no audio or immediate hangups, a missing IP allowlist entry is a common cause. Verify that both the signaling IPs and media subnets are allowed in your firewall. ## Setup Steps 1. Navigate to your Template configuration 2. Add your agent's phone number or SIP address in the agent connection settings 3. Configure your test sets with scenarios you want to test 4. Select the personas for your simulated callers 5. Choose the metrics you want to evaluate 6. Launch your evaluation ## Best Practices - Use realistic phone numbers that your agent can receive calls on - Test with different personas to cover various customer types - Include both happy path and edge case scenarios in your test sets - Monitor latency and interruption metrics for voice quality --- ## Twilio ConversationRelay + OTel Traces Source: https://docs.coval.ai/guides/simulations/twilio-conversationrelay Add OpenTelemetry tracing to a Twilio Programmable Voice agent and correlate spans with Coval simulations. ## Overview [Twilio ConversationRelay](https://www.twilio.com/docs/voice/conversation-relay) lets you connect a Twilio Programmable Voice call to a WebSocket server that handles STT → LLM → TTS in real time. This guide covers how to: 1. Build an OTel span tree from ConversationRelay events and export it to Coval 2. Correlate traces with Coval simulation runs despite Twilio PSTN stripping SIP headers For a complete working implementation, see the [coval-examples Twilio agent](https://github.com/coval-ai/coval-examples/tree/main/voice-agents/twilio) on GitHub. ## Walkthrough [Video: Loom Video](https://www.loom.com/embed/47cdbd14d9e441c094a1e18deb01f81e) ## The PSTN limitation When Coval places a simulation call to your agent, it normally passes the simulation output ID as a custom SIP header: ``` X-Coval-Simulation-Id: ``` This works for agents using SIP trunking (Telnyx, custom SBCs) where the SIP signaling layer is preserved end-to-end. Twilio Programmable Voice, however, routes calls through the public telephone network (PSTN). PSTN carriers strip non-standard SIP headers, so `X-Coval-Simulation-Id` never reaches your application. ## Solution: pre_call_webhook_url Coval supports an alternative correlation mechanism for agents where SIP headers are unavailable. Configure `pre_call_webhook_url` on your agent and Coval will POST the simulation output ID to your agent **before** dialing, giving it a chance to stash the ID before the call connects. The webhook is called once per simulation, immediately before the outbound call is placed. It receives: ```json { "simulation_output_id": "", "from_number": "+16504471573", "to_number": "+15105077509" } ``` `from_number` is the caller ID Coval will dial from. `to_number` is your agent's phone number. Use these to correlate the webhook with the incoming call — especially useful when running multiple agent replicas behind a load balancer. Your agent queues this ID, then pops it when the next call arrives. ## Configure your Twilio phone number Before Coval can place simulation calls to your agent, your Twilio phone number must be configured to route inbound calls to your agent's webhook. 1. Open the [Twilio Console](https://console.twilio.com) and navigate to **Phone Numbers → Manage → Active Numbers**. 2. Click the phone number you want to use for simulations. 3. Scroll to the **Voice Configuration** section. 4. Set **Configure with** to **Webhook, TwiML Bin, Function, Studio Flow, Proxy Service**. 5. Under **A call comes in**, select **Webhook** and enter your agent's webhook URL (e.g. `https://your-agent.fly.dev/webhook`). Leave HTTP method as **HTTP POST**. 6. Save the configuration. > **Tip:** The `/webhook` endpoint is the entry point for all inbound calls. When Twilio receives a call on your number, it POSTs to this URL and expects TwiML in response — typically a `` instruction pointing at your WebSocket handler. ## Coval agent configuration In the Coval dashboard, open your agent's settings and set the following in the agent metadata: ```json { "pre_call_webhook_url": "https://your-agent.fly.dev/register-simulation", "pre_call_webhook_headers": {"x-api-key": ""} } ``` | Field | Description | |-------|-------------| | `pre_call_webhook_url` | The URL Coval will POST to before each simulation call | | `pre_call_webhook_headers` | Optional headers to include — use this to authenticate Coval's request to your agent | > **Tip:** Use `COVAL_API_KEY` (your Coval API key) as the value for `x-api-key` and validate it in your `/register-simulation` handler. This prevents other callers from pre-registering IDs. ## Agent implementation ### /register-simulation endpoint Add an endpoint that accepts Coval's pre-call notification and queues the simulation ID: ```python from collections import deque from typing import Optional from fastapi import FastAPI, Header, HTTPException, Request from fastapi.responses import JSONResponse app = FastAPI() COVAL_API_KEY = os.environ.get("COVAL_API_KEY", "") # Queue of (simulation_output_id, from_number, registered_at) tuples _pending_sim_ids: deque[tuple[str, str, float]] = deque() _SIM_ID_TTL_SECONDS = 300 # expire after 5 minutes def _pop_pending_sim_id(caller_id: str = "") -> Optional[str]: """Return the simulation ID matching the caller, or the oldest non-expired one.""" now = time.time() # Purge expired entries while _pending_sim_ids and now - _pending_sim_ids[0][2] > _SIM_ID_TTL_SECONDS: _pending_sim_ids.popleft() # Try to match by caller ID first for i, (sim_id, from_number, _) in enumerate(_pending_sim_ids): if from_number and caller_id and from_number == caller_id: del _pending_sim_ids[i] return sim_id # Fall back to FIFO if no caller ID match if _pending_sim_ids: sim_id, _, _ = _pending_sim_ids.popleft() return sim_id return None @app.post("/register-simulation") async def register_simulation( request: Request, x_api_key: str = Header(default=""), ): if not COVAL_API_KEY or x_api_key != COVAL_API_KEY: raise HTTPException(status_code=401, detail="Invalid API key") body = await request.json() simulation_output_id = body.get("simulation_output_id", "") if not simulation_output_id: raise HTTPException(status_code=400, detail="simulation_output_id is required") from_number = body.get("from_number", "") _pending_sim_ids.append((simulation_output_id, from_number, time.time())) return JSONResponse({"status": "ok", "queued": len(_pending_sim_ids)}) ``` ### Reading the simulation ID on call arrival In your ConversationRelay WebSocket handler, pop the pending ID when the `"setup"` event arrives: ```python @app.websocket("/ws") async def conversationrelay_websocket(websocket: WebSocket): await websocket.accept() simulation_id: Optional[str] = None async for raw_message in websocket.iter_text(): event = json.loads(raw_message) event_type = event.get("type", "") if event_type in ("setup", "connected"): # Match by caller ID from the webhook registration caller_id = event.get("from", "") simulation_id = _pop_pending_sim_id(caller_id) elif event_type == "prompt": voice_prompt = event.get("voicePrompt", "") # ... call LLM, stream response back to Twilio ... ``` > **Tip:** For multi-replica deployments (Kubernetes, auto-scaling groups), store the pending simulation IDs in a shared datastore (e.g. Redis, your application database) instead of an in-memory queue. The `from_number` field lets any replica match the incoming call to the correct simulation, regardless of which replica received the webhook. ### Exporting traces after the call When the WebSocket closes, build OTLP spans from your turn log and POST them to Coval: ```python COVAL_TRACES_URL = "https://api.coval.dev/v1/traces" def _send_spans(spans: list[dict], simulation_id: str) -> None: payload = { "resourceSpans": [ { "resource": { "attributes": [ {"key": "service.name", "value": {"stringValue": "twilio-voice-agent"}} ] }, "scopeSpans": [ { "scope": {"name": "twilio-voice-agent"}, "spans": spans, } ], } ] } httpx.post( COVAL_TRACES_URL, json=payload, headers={ "x-api-key": COVAL_API_KEY, "X-Simulation-Id": simulation_id, }, timeout=30, ) ``` Call `_send_spans` in the `finally` block of your WebSocket handler, after the call ends: ```python finally: if simulation_id and turns: call_duration_seconds = time.time() - call_start_epoch_seconds spans = _build_spans_from_turns(turns, call_start_epoch_seconds, call_duration_seconds) _send_spans(spans, simulation_id) ``` ## Async audio attachment (multi-replica deployments) Twilio Programmable Voice recording URLs are not retrievable at call end — Twilio finalizes the recording asynchronously, typically ~60 seconds later. If your agent runs as a single long-lived process you can simply wait, but in a multi-replica Kubernetes or Fly.io deployment the agent container may be terminated before the URL becomes available. For those deployments, split conversation submission across two API calls: 1. **At call end**, `POST /v1/conversations:submit` with the transcript only. You immediately get a `conversation_id` and text-only metrics start running. Flush your buffered OTel spans with this `conversation_id` so traces correlate with the conversation. 2. **When the recording URL is ready**, `PATCH /v1/conversations/{conversation_id}` with the `audio_url`. Audio-dependent metrics then run as a second wave. Each wave delivers a separate webhook. Configure your consumer to expect both. ```bash # Step 1: submit transcript at call end (returns conversation_id immediately) curl -X POST https://api.coval.dev/v1/conversations:submit \ -H 'x-api-key: $COVAL_API_KEY' \ -H 'Content-Type: application/json' \ -d '{ "transcript": [...], "external_conversation_id": "CAxxxxx" }' # Step 2: attach audio once the Twilio recording URL is ready curl -X PATCH https://api.coval.dev/v1/conversations/${CONVERSATION_ID} \ -H 'x-api-key: $COVAL_API_KEY' \ -H 'Content-Type: application/json' \ -d '{"audio_url": "https://api.twilio.com/.../Recordings/RExxxxx.wav"}' ``` Audio can only be attached once per conversation — a second `PATCH /v1/conversations/{conversation_id}` returns `409 ALREADY_EXISTS`. ## Trace limitations ConversationRelay abstracts STT and TTS away from your application code entirely — you receive transcribed text in `"prompt"` events and send text tokens back; Twilio handles the rest. This means several span values **cannot be measured** and must be approximated. These are architectural constraints of the ConversationRelay model, not implementation choices. > **Warning:** The following trace values are **synthetic** when using Twilio ConversationRelay. Do not use them for latency analysis, benchmarking, or metric thresholds. | Value | Why it must be synthetic | |-------|--------------------------| | `stt` → `metrics.ttfb` | Twilio performs speech recognition internally. Your application only receives the final transcribed text in a `"prompt"` WebSocket event — there is no timestamp for when speech started or when transcription completed. | | `stt` → `stt.confidence` | Twilio does not expose per-utterance ASR confidence scores through the ConversationRelay WebSocket API. Fixed at `0.95`. | | `tts` → `metrics.ttfb` | Twilio converts your text tokens to audio internally. Your application has no visibility into when audio playback begins at the caller's end. Fixed at `0.1s`. | The one value that **is** real: `llm` → `metrics.ttfb`. Because your application makes the LLM API call directly, you can measure wall-clock time from when the `"prompt"` event arrives to when the first response token is sent back. This is the only latency signal from ConversationRelay traces worth trusting. **Practical implication:** Coval's built-in STT TTFB and TTS TTFB latency metrics will not reflect real performance for Twilio ConversationRelay agents. LLM TTFB metrics will. If you need real STT/TTS timing data, consider a framework where you control the STT and TTS API calls directly (e.g., Pipecat or LiveKit), which expose those timings to your instrumentation code. ## Span schema | Span | Key attributes | Notes | |------|----------------|-------| | `conversation` | `call.duration_seconds` | Root span | | `stt` | `transcript`, `metrics.ttfb` (**synthetic**), `stt.confidence` (**synthetic** 0.95) | One per user turn | | `stt.provider.twilio` | `stt.providerName`, `stt.confidence`, `metrics.ttfb` | Child of `stt` | | `llm` | `metrics.ttfb` (**real**), `llm.finish_reason` | One per assistant turn | | `tts` | `metrics.ttfb` (**synthetic** 0.1s) | One per assistant turn | | `llm_tool_call` | `function.name`, `tool_call_id`, `function.arguments` | When tools are invoked | | `tool_call_result` | `function.name`, `tool_call_id`, `tool.result` | Status = ERROR if tool returned an error | ## Viewing traces [Video: Loom Video](https://www.loom.com/embed/1926ce26bc9f485aaef61f57ff7a7f7a) After a simulation completes, an **OTel Traces** card appears in the metric grid on the result page. Click **View Traces** to open the trace viewer. If no traces appear, check: 1. `pre_call_webhook_url` is set on the Coval agent and points to the correct URL 2. Your `/register-simulation` endpoint is publicly accessible and returning `200 OK` 3. The `COVAL_API_KEY` in the `pre_call_webhook_headers` matches what your agent expects 4. `COVAL_API_KEY` is set in the agent environment (needed to export spans) ## Full example See the complete working implementation in [coval-examples/voice-agents/twilio](https://github.com/coval-ai/coval-examples/tree/main/voice-agents/twilio), which includes: - ConversationRelay WebSocket handler with interrupt support - Agentic LLM loop (tool calls → re-enter loop until `finish_reason = stop`) - Full span builder with real LLM TTFB measurement - Fly.io deployment configuration --- ## Outbound Voice Source: https://docs.coval.ai/concepts/agents/connections/outbound-voice Configure agents that initiate phone calls to users ## Overview Outbound Voice connections simulate your agent making calls to users. ## Configuration Requirements ### Trigger Call Endpoint - **Field**: `trigger_call_endpoint` - **Type**: String (required) - **Format**: Valid HTTPS URL - **Purpose**: The webhook endpoint where outbound calls are initiated - **Example**: `https://your-endpoint.com/webhook` ### Trigger Call Headers - **Field**: `trigger_call_headers` - **Type**: String (optional) - **Format**: Valid JSON string - **Purpose**: HTTP headers to include in trigger requests - **Example**: `{"Content-Type": "application/json", "Authorization": "Bearer token123"}` ### Phone Number Key - **Field**: `phone_number_key` - **Type**: String (optional) - **Default**: `"phone_number"` - **Purpose**: Key used to identify the phone number in the trigger payload - **Example**: `"phone_number"`, `"recipient_phone"`, `"target_number"` ### Trigger Call Payload - **Field**: `trigger_call_payload` - **Type**: String (optional) - **Format**: Valid JSON string - **Purpose**: Additional data to send with the trigger request - **Example**: `{"sequence_code": "ABC123", "campaign_id": "summer2024"}` ## Setup Instructions 1. Configure webhook endpoint URL that can receive HTTP POST requests 2. Set authentication headers as JSON string 3. Define phone number key and payload structure 4. Test webhook response and call initiation ## Technical Details **Call Initiation Flow** 1. Coval sends HTTP POST request to trigger endpoint 2. Your system receives webhook and initiates call 3. Agent connects to specified phone number 4. Conversation proceeds with real-time monitoring **Webhook Payload Structure** ```json { "phone_number": "+12345678901", "sequence_code": "ABC123", "campaign_id": "summer2024" } ``` ## Troubleshooting **Common Issues:** - **Webhook Not Triggering**: Verify endpoint URL and accessibility - **Invalid JSON Format**: Validate headers and payload syntax - **Authentication Failures**: Check authorization headers and tokens - **Call Connection Issues**: Verify phone number format and availability --- ## Outbound Voice Simulations Source: https://docs.coval.ai/guides/outbound-voice Coval's Outbound Voice Simulation feature enables you to test your voice agents by having Coval simulate realistic customer interactions. Instead of your agent calling a test phone number, Coval triggers your system to initiate an outbound call to our simulated customer, creating a more realistic testing environment that mirrors your production call flows. ### **How It Works** 1. **You configure** an endpoint that Coval can call to trigger outbound calls 2. **Coval starts** a simulation and sends a request to your trigger endpoint 3. **Your system** receives the request and initiates an outbound call to our simulated customer 4. **The simulation runs** with realistic customer responses based on your test scenarios 5. **Coval provides** detailed transcripts, recordings, and analysis of the interaction --- ## **Getting Started** ### **Prerequisites** Before setting up outbound voice simulations, ensure you have: - An endpoint that can receive HTTP POST requests - The ability to initiate outbound calls from your phone system - API authentication mechanism (recommended) - Phone system capable of dialing phone numbers ### **Quick Setup** 1. **Prepare Your Trigger Endpoint** - Create an endpoint that accepts POST requests - Implement authentication (API key, bearer token, etc.) - Add logic to extract phone number and initiate calls 2. **Configure in Coval** - Navigate to your simulation settings - Select "Outbound Voice" as the simulation type - Enter your trigger endpoint URL - Configure headers and payload format - Test the configuration 3. **Run Your First Simulation** - Create a test scenario - Start the simulation - Monitor the call initiation and interaction ## **Configuration Details** ### **Trigger Call Endpoint** **Purpose**: The URL where Coval will send requests to trigger outbound calls from your system. This is a required field. **Requirements**: - Must be a valid HTTP/HTTPS URL - Should be publicly accessible or whitelisted for Coval's IP ranges - Must respond within 30 seconds - Should return 2xx status codes for successful requests **Example Configuration**: ``` https://api.yourcompany.com/triggers/voice-simulation ``` ### **Trigger Call Headers** Configure HTTP headers that Coval will include with every trigger request. This typically includes authentication and content type headers. This is a required field. **Format**: Valid JSON object ```json { "Content-Type": "application/json", "Authorization": "Bearer your-api-key-here", "X-Source": "coval-simulation" } ``` **Common Headers**: - `Authorization`: API keys, bearer tokens, or basic auth - `Content-Type`: Usually `application/json` - `X-API-Key`: Alternative authentication method - Custom headers for routing or identification ### **Trigger Call Payload** The base JSON payload that Coval will send to your endpoint. Coval automatically adds the phone number field to this payload. This is a required field. **Format**: Valid JSON object ```json { "campaign_id": "test-campaign-001", "priority": "high", } ``` **Note**: The `phone_number` field will be automatically added by Coval. ### **Phone Number Key** Customize the field name used for the phone number in the payload. This allows integration with systems that expect different field names. **Default**: phone_number **Common Alternatives**: - `phoneNumber` - `phone` - `number` - `phoneNumberToCall` - `destination` **Example**: If your system expects `destination`, configure this field as `destination`, and the payload will include: ```json { "campaign_id": "test-campaign-001", "destination": "+1-555-123-4567" } ``` --- ## **Advanced Features** ### **Multi-Language Support** Coval supports simulations in multiple languages with native voice models: - **English** (Default): Standard US English voice and responses - **Spanish**: Latin American Spanish with appropriate cultural context - **French**: Standard French with proper pronunciation and idioms - **German**: Standard German with accurate grammar and expressions - **Arabic**: Modern Standard Arabic, transcripts rendered in native script with right-to-left alignment - **Hebrew**: Hebrew with native script and right-to-left alignment **Configuration**: Language is typically configured at the organization level or specified in simulation parameters. ### **Custom Voice Models** Configure specific voice characteristics for your simulations: - **Voice Provider**: Choose from multiple voice synthesis providers - **Voice Model**: Select specific voice models (multilingual, turbo, etc.) - **Voice ID**: Use specific voice identities for consistent testing ### **Simulation Behavior** Control how the simulated customer behaves during calls: - **Response Style**: Natural, conversational interactions with appropriate emotional responses - **Conversation Flow**: Realistic pauses, interruptions, and speaking patterns - **Scenario Adherence**: Follows predefined customer scenarios and objectives - **Language Consistency**: Maintains language and cultural context throughout the call --- ## OpenAI Endpoint Source: https://docs.coval.ai/concepts/agents/connections/openai-endpoint Connect to OpenAI-compatible chat completions APIs for text-based interactions ## Overview OpenAI Endpoint connections integrate with any OpenAI-compatible chat completions API, enabling text-based conversational testing. ## Configuration Requirements ### Chat Endpoint - **Field**: `chat_endpoint` - **Type**: String (required) - **Purpose**: URL endpoint for OpenAI-compatible chat completions API - **Format**: Valid HTTPS URL - **Example**: `https://api.openai.com/v1/chat/completions` ### Authentication Token - **Field**: `auth_token` - **Type**: String (required) - **Purpose**: API key or authentication token for the endpoint - **Format**: String (maximum 4KB length) - **Security**: Sensitive field - stored encrypted - **Example**: `sk-proj-abc123def456...` ### Model - **Field**: `model` - **Type**: String (optional) - **Default**: `"gpt-4o"` - **Purpose**: Specify which OpenAI model to use for completions **Available Models:** - **GPT-5 Series**: `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `gpt-5-chat` - **GPT-4.1 Family**: `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano` - **Reasoning Models**: `o1`, `o1-mini`, `o1-pro`, `o3-mini`, `o3-pro`, `o3`, `o4-mini` - **Current Generation**: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo` - **Legacy Models**: `gpt-4`, `gpt-3.5-turbo` ### Max Tokens - **Field**: `max_tokens` - **Type**: Number (optional) - **Purpose**: Maximum number of tokens to generate in responses - **Range**: Positive integer ≤ 100,000 - **Example**: `1000` ### System Prompt - **Field**: `system_prompt` - **Type**: String (optional) - **Purpose**: System-level instructions defining agent behavior and constraints - **Use Cases**: Role definition, behavior guidelines, response formatting - **Example**: `"You are a helpful customer service assistant. Always be polite and provide accurate information."` ### Temperature - **Field**: `temperature` - **Type**: Number (optional) - **Default**: `1.0` - **Purpose**: Controls randomness in AI responses - **Range**: 0.0 (deterministic) to 2.0 (very random) - **Example**: `0.7` ## Setup Instructions 1. Enter the complete URL for your OpenAI-compatible API 2. Input valid API key from your provider 3. Select model and configure token limits/temperature 4. Define system prompt for agent behavior ## Troubleshooting **Common Issues:** - **Authentication Failures**: Verify API key validity and permissions - **Invalid Endpoint**: Confirm URL format and API compatibility - **Model Not Available**: Check model name and provider support - **Rate Limit Errors**: Monitor API usage and implement backoff strategies --- ## Chat Simulations Source: https://docs.coval.ai/guides/simulations/chat Simulate text-based conversations with your chat agent ## Overview If you have chat agents or want to simulate your voice agent conversations without calls, you can use Coval the same way it's used for voice simulations by generating text conversations. Create test sets, define metrics, and configure templates to evaluate your chat agents automatically. **Requirements**: Provide a custom text endpoint that Coval can connect to ## Quick Start **Minimum Required Configuration:** 1. **Chat Endpoint** - The URL where your agent receives messages 2. **Authorization Header** - Authentication credentials for your API That's it! All other fields are optional and depend on your specific API requirements. --- ## Core Configuration ### Chat Endpoint (Required) The primary URL where Coval sends conversation messages during simulation. **Format:** Full HTTPS URL ``` https://api.yourdomain.com/chat ``` **Requirements:** - Must use HTTPS (HTTP will be auto-upgraded) - Cannot use local/private IP addresses - Must return JSON responses **Example:** ``` https://api.yourdomain.com/v1/chat ``` --- ### Authorization Header (Required) Authentication credentials sent with every API request. **Common Formats:** #### Bearer Token ``` Bearer your-secret-token-here ``` #### API Key ``` API-Key your-api-key-here ``` #### Custom Authorization ``` Custom-Auth your-custom-format ``` **UI Tip:** Use the dropdown to select your auth type, then paste your token/key. The system automatically formats the header correctly. --- ## Standard Protocol The standard integration for chat uses an HTTP, JSON-based API endpoint that you provide. When running a simulation, Coval's simulator, acting as a user, will connect to the endpoint to get responses from your agent. We use OpenAI's chat completions format, although we also support receiving responses in the OpenAI responses format. Query strings are not allowed in URLs. --- ## Optional Configuration ### Initialization Endpoint Called once before the conversation starts to set up session state. **When to use:** - Your API requires session initialization - You need to obtain a session ID or auth token - You want to pre-configure conversation context **Format:** Full HTTPS URL ``` https://api.yourdomain.com/init ``` **Example Response:** ```json { "sessionId": "abc-123-def", "userId": "user-456", "conversationId": "conv-789" } ``` **How it works:** 1. Coval calls the initialization endpoint with your [initialization payload](#initialization-payload) 2. The response is captured and made available to subsequent chat requests via template variables like `{{sessionId}}` or `{{init_response.conversationId}}` in your [input template](#input-template) and [custom headers](#custom-headers) > **Note:** The initialization endpoint is called **before** any chat messages are sent. The [input template](#input-template) is only used for chat requests — it does not affect the initialization call. --- ### Initialization Payload JSON payload sent to the initialization endpoint. **Format:** Valid JSON ```json { "user_id": "test-user", "config": { "language": "en", "temperature": 0.7 } } ``` **Template Variables Available:** | Variable | Description | Example Value | |----------|-------------|---------------| | `{{simulation_output_id}}` | Unique ID for this simulation | `"sim-abc-123"` | | `{{persona.field}}` | Data from persona metadata | `{{persona.user_id}}` | **Example with Variables:** ```json { "session_id": "{{simulation_output_id}}", "user_context": { "user_id": "{{persona.user_id}}", "preferences": {} } } ``` **Persona Integration:** To use `{{persona.field}}` variables, add `initialization_parameters` to your Persona's metadata: ```json { "initialization_parameters": { "user_id": "customer-123", "account_type": "premium" } } ``` Then reference in payload: ```json { "user_id": "{{persona.user_id}}", "account_type": "{{persona.account_type}}" } ``` --- ### Custom Data Additional JSON data included in every chat request (for APIs using standard payload format). **Format:** Valid JSON ```json { "metadata": { "source": "coval-evaluation", "version": "1.0" }, "context": { "department": "sales" } } ``` **How it's sent:** ```json { "messages": [...], "customData": { "metadata": {...}, "context": {...} } } ``` **Note:** Only used when NOT using `input_template`. If you use `input_template`, reference custom data with `{{custom_data.field}}` instead. --- ### Custom Headers Additional HTTP headers sent with every chat request, with support for dynamic values from the initialization response. **Format:** Valid JSON object with string keys and values ```json { "X-Session-ID": "{{sessionId}}", "X-User-ID": "{{init_response.user.id}}", "X-Custom-Header": "static-value" } ``` **Template Variables Available:** | Variable | Description | Example | |----------|-------------|---------| | `{{sessionId}}` | Session ID from init response | Extracted from `init_response.sessionId` | | `{{simulation_output_id}}` | Unique simulation ID | Generated by Coval | | `{{init_response.path}}` | Any nested field from init response | `{{init_response.auth.token}}` | **Common Use Cases:** **Use Case 1: Session ID in Header** ```json { "X-Session-ID": "{{sessionId}}" } ``` **Use Case 2: Nested Auth Token** ```json { "X-Auth-Token": "{{init_response.auth.token}}" } ``` **Use Case 3: Mixed Static and Dynamic** ```json { "X-Session-ID": "{{sessionId}}", "X-API-Version": "v2", "X-Simulation-ID": "{{simulation_output_id}}" } ``` **Field Size Limit:** 16kB maximum --- ## Chat Messages Your chat endpoint should be an HTTPS URL that will respond to POST requests with a JSON body. If an Authorization token was provided, it will be included in the headers. **Initial Request from Coval:** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" } ] } ``` **Expected Response Format:** Standard format: ```json { "messages": [ { "role": "assistant", "content": "Response to the initial text" } ] } ``` Or in the newer Responses format: ```json { "messages": [ { "role": "assistant", "content": [ { "type": "text_output", "text": "Thanks for contacting us" } ] } ] } ``` --- ## Advanced Configuration ### Response Format Determines the format for tool call responses sent to your API. **Options:** #### Chat Completions (Default) Standard OpenAI format for tool responses: ```json { "role": "tool", "content": "result", "tool_call_id": "call_123" } ``` #### Responses API Alternative format for function call outputs: ```json { "type": "function_call_output", "call_id": "call_123", "output": "result" } ``` Configure the response format by adding `response_format` to your model configuration: ```json { "response_format": "responses_api", "chat_endpoint": "https://api.your-company.com/chat", "authorization_header": "Bearer your-api-key" } ``` Coval will respond with the entire chat history in the format specified: **Chat Completions (Default):** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" }, { "role": "assistant", "content": "Thanks for contacting us" }, { "role": "user", "content": "When will my order arrive?" } ] } ``` **Responses API:** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" }, { "role": "assistant", "content": [ { "type": "text_output", "text": "Thanks for contacting us" } ] }, { "role": "user", "content": "When will my order arrive?" } ] } ``` **When to use:** Only change this if your API explicitly requires the Responses API format for tool calls. Most APIs use the default Chat Completions format. --- ### Tool Calls You can include tool/function calls in the Responses format: ```json { "messages": [ { "type": "function_call", "name": "get_order_date", "arguments": "{\"shipment_id\": \"xx555\"}" }, { "role": "assistant", "content": [ { "type": "text_output", "text": "Your order should arrive next Tuesday" } ] } ] } ``` --- ### Payload Wrapper Wraps the entire payload in a specified field name. **When to use:** Your API requires all payloads nested under a specific key (e.g., `data`, `request`, `body`). **Example:** **Without wrapper:** ```json { "messages": [...], "customData": {...} } ``` **With wrapper set to `"data"`:** ```json { "data": { "messages": [...], "customData": {...} } } ``` **Common Values:** - `data` - `request` - `body` - `payload` --- ### Input Template Completely customize the JSON payload sent to your **chat endpoint** on each conversation turn. > **Note:** The input template is **not** used for the [initialization endpoint](#initialization-endpoint) — that call always uses the [initialization payload](#initialization-payload). The simulation flow is: > > 1. Coval calls your initialization endpoint with the initialization payload > 2. The init response is captured > 3. For each chat turn, Coval uses the input template to build the request to your chat endpoint — and you can reference fields from the init response (e.g. `{{init_response.conversation_id}}`) **When to use:** - Your API expects a non-standard payload format - You need to include specific fields from initialization response - You want fine-grained control over the request structure **Format:** JSON with template variable placeholders **Available Template Variables:** | Variable | Type | Description | |----------|------|-------------| | `{{messages}}` | Array | Full conversation history | | `{{latest_message}}` | String | Most recent user message content | | `{{sessionId}}` | String | Session ID (from init response or simulation ID) | | `{{simulation_output_id}}` | String | Unique simulation identifier | | `{{custom_data}}` | Object | The custom data object | | `{{custom_data.field}}` | Any | Specific field from custom data | | `{{any.nested.path}}` | Any | Extract any field from init response using dot notation | **Example Templates:** **Example 1: Simple Custom Format** ```json { "user_input": "{{latest_message}}", "session_id": "{{sessionId}}", "context": {{custom_data}} } ``` **Example 2: Nested Init Response Fields** ```json { "messages": {{messages}}, "user_id": "{{init_response.user.id}}", "conversation_id": "{{init_response.conversation.id}}", "api_key": "{{init_response.auth.api_key}}" } ``` **Example 3: String Substitution** ```json { "input": "User said: {{latest_message}}", "session": "{{sessionId}}", "metadata": { "source": "coval", "user": "{{custom_data.user_id}}" } } ``` **Note:** When using `input_template`, the `custom_data` field is ignored. Reference custom data using `{{custom_data}}` or `{{custom_data.field}}` in your template instead. > **Warning:** **Quoting rules for template variables:** > > - **Object/Array variables** (`{{messages}}`, `{{custom_data}}`) substitute to valid JSON literals — do **not** wrap them in quotes. > - **String variables** (`{{sessionId}}`, `{{latest_message}}`, `{{init_response.*}}`) substitute to plain text — you **must** wrap them in quotes. > > For example, `"conversation_id": {{init_response.conversation.id}}` produces invalid JSON because the substituted value is not quoted. Use `"conversation_id": "{{init_response.conversation.id}}"` instead. --- ### Response Message Path Tells Coval where to find the assistant's message in your API response using dot notation. **When to use:** Your API returns a non-standard response format. **Default Behavior (when not set):** Expects response in this format: ```json { "messages": [ { "role": "assistant", "content": "The response text" } ] } ``` **Custom Path Examples:** **Example 1: Direct Field** ``` Response Message Path: output_message ``` Extracts from: ```json { "output_message": "The assistant response", "metadata": {...} } ``` **Example 2: Nested Object** ``` Response Message Path: data.response.text ``` Extracts from: ```json { "data": { "response": { "text": "The assistant response" } } } ``` **Example 3: Array Index** ``` Response Message Path: choices.0.message.content ``` Extracts from: ```json { "choices": [ { "message": { "content": "The assistant response" } } ] } ``` **Path Notation Rules:** - Use `.` to navigate nested objects: `data.response.text` - Use numeric indices for arrays: `choices.0.message` - Combine for complex paths: `data.results.0.output.text` --- ### Strip Message Timestamps Removes `timestamp` fields from messages before sending to your API. **When to use:** Your API rejects requests containing timestamp fields. **Default:** Disabled (timestamps included) **Example:** **With timestamps (default):** ```json { "messages": [ { "role": "user", "content": "Hello", "timestamp": "2025-01-15T10:30:00Z" } ] } ``` **With stripping enabled:** ```json { "messages": [ { "role": "user", "content": "Hello" } ] } ``` **Common Error Pattern:** ```json { "message": ["messages.0.property timestamp should not exist"], "statusCode": 400 } ``` If you see this error, enable "Strip Message Timestamps". --- ## Ending the Chat You can end the conversation by setting "status" to "ended" in your response: ```json { "status": "ended", "messages": [...] } ``` --- ## Common Configuration Patterns ### Pattern 1: OpenAI-Compatible API ``` Chat Endpoint: https://api.yourdomain.com/chat Authorization: Bearer sk-your-key-here (All other fields: leave empty/default) ``` ### Pattern 2: API with Session Initialization ``` Chat Endpoint: https://api.yourdomain.com/chat Initialization Endpoint: https://api.yourdomain.com/init Authorization: API-Key your-key-here Initialization Payload: { "user_id": "{{persona.user_id}}", "session_id": "{{simulation_output_id}}" } ``` ### Pattern 3: Custom API Format with Template ``` Chat Endpoint: https://api.yourdomain.com/v1/message Authorization: Bearer your-token Input Template: { "user_input": "{{latest_message}}", "session_id": "{{sessionId}}", "conversation_history": {{messages}} } Response Message Path: data.response.text ``` ### Pattern 4: API with Payload Wrapper ``` Chat Endpoint: https://api.yourdomain.com/chat Authorization: Bearer your-token Payload Wrapper: data ``` ### Pattern 5: Complex Custom Format ``` Chat Endpoint: https://api.yourdomain.com/chat Initialization Endpoint: https://api.yourdomain.com/sessions/create Authorization: Bearer static-token Custom Headers: { "X-Session-ID": "{{sessionId}}", "X-User-Context": "{{init_response.user.id}}" } Input Template: { "messages": {{messages}}, "user_id": "{{init_response.user.id}}", "api_version": "v2" } Response Message Path: response.text Strip Message Timestamps: true ``` --- ## Troubleshooting ### Error: "Failed to run simulation due to an unexpected error" **Problem:** This generic error often indicates an issue with your agent configuration, most commonly an invalid input template. **Common causes:** - Unquoted string variables in your input template (e.g. `{{init_response.id}}` instead of `"{{init_response.id}}"`) - Malformed JSON in your input template, initialization payload, or custom data - Invalid field references in template variables **Solution:** Double-check your input template for valid JSON syntax. Make sure all string template variables are wrapped in quotes. See the [quoting rules](#input-template) in the Input Template section. ### Error: "Invalid JSON response from endpoint" **Problem:** Your API returned non-JSON response **Solution:** Ensure your endpoint returns `Content-Type: application/json` ### Error: "Could not extract message from path 'X' in response" **Problem:** Response message path doesn't match your API response structure **Solution:** Verify the path using dot notation matches your actual response structure ### Error: "messages.0.property timestamp should not exist" **Problem:** Your API rejects timestamp fields **Solution:** Enable "Strip Message Timestamps" ### Tool calls not showing in transcript **Problem:** Tool call extraction not configured **Solution:** Verify your API returns OpenAI-compatible format or contact support for custom tool call extraction configuration ### Session ID not working in headers **Problem:** Template variable not being substituted **Solution:** Verify initialization endpoint returns `sessionId` field, or check custom headers configuration --- ## Best Practices 1. **Start Simple:** Begin with just Chat Endpoint and Authorization, add complexity as needed 2. **Test Incrementally:** Add one advanced feature at a time and test 3. **Use Template Variables:** Leverage `{{sessionId}}` and init response fields to maintain session state 4. **Validate JSON:** Always validate JSON fields before saving 5. **Check API Logs:** Use your API server logs to debug payload/response format issues 6. **Document Custom Formats:** Keep notes on your API's expected format for future reference --- ## Chat WebSocket Source: https://docs.coval.ai/concepts/agents/connections/chat-websocket Connect to text chat agents over a persistent WebSocket connection ## Overview Chat WebSocket agents communicate via text messages over a persistent WebSocket connection. Unlike the standard [Chat (HTTP)](/guides/simulations/chat) integration which uses request-response, Chat WebSocket maintains a single connection for the entire conversation — ideal for agents built on platforms like Genesys, NICE, or custom WebSocket-based chat systems. **When to use Chat WebSocket instead of Chat (HTTP):** - Your agent communicates over WebSocket rather than HTTP POST - Your agent sends multiple messages in response to a single user message - Your platform requires a persistent connection for the conversation lifecycle ## Connection Modes Chat WebSocket supports two connection modes: ### Direct Mode (Default) Connect directly to a WebSocket endpoint. ``` wss://your-agent.example.com/ws/chat ``` ### HTTP-First Mode Call an HTTP endpoint first to create a session, then connect to the WebSocket URL returned in the response. Common with platforms that require session provisioning before establishing a WebSocket connection. **Flow:** 1. Coval sends an HTTP request to your setup endpoint 2. Your API returns a response containing the WebSocket URL 3. Coval connects to that WebSocket URL ## Configuration ### Direct Mode Fields | Field | Required | Description | |-------|----------|-------------| | WebSocket Endpoint | Yes | The `wss://` URL to connect to | | Initialization JSON | No | JSON payload sent immediately after connection | | Authorization Header | No | Auth value sent during the WebSocket handshake | | Custom Headers | No | Additional headers for the WebSocket handshake | ### HTTP-First Mode Fields | Field | Required | Description | |-------|----------|-------------| | HTTP Endpoint URL | Yes | The `https://` URL to call for session setup | | HTTP Method | No | Request method (default: POST) | | Request Body | No | JSON body for the HTTP request | | HTTP Headers | No | Headers for the HTTP request | | WebSocket URL Response Path | Yes | Dot-notation path to the WebSocket URL in the response | | Authorization Header | No | Auth value for the WebSocket connection (separate from HTTP headers) | | Initialization JSON | No | JSON payload sent after WebSocket connection | | Custom Headers | No | Additional headers for the WebSocket connection | ## Message Format ### Sending Messages (Coval to Agent) Messages are sent as JSON using a configurable template. The default template: ```json {"type": "message", "text": "{{message}}"} ``` The `{{message}}` placeholder is replaced with the actual message text. Customize the template to match your agent's expected format: ```json {"event": "chat", "body": "{{message}}"} ``` ### Receiving Messages (Agent to Coval) Coval extracts text from incoming WebSocket messages using configurable JSON paths: | Setting | Default | Description | |---------|---------|-------------| | Message type path | `type` | Path to the message type field | | Text message type values | `message` | Type value(s) that indicate a text message | | Message text path | `text` | Path to the actual message content | **Example:** For an agent that sends: ```json {"event": "reply", "data": {"content": "Hello!"}} ``` Configure: - Message type path: `event` - Text message type values: `reply` - Message text path: `data.content` ## Message Coalescing Many chat agents send multiple messages in quick succession (e.g., a greeting followed by a question). Coval batches these into a single response using a configurable quiet period. - **Default:** 2.0 seconds - **Set to 0:** Deliver each message immediately (no batching) - **Increase:** For agents that send messages with longer pauses between them ## Handshake Some WebSocket agents send a "ready" message before accepting conversation messages. Configure the handshake to wait for this signal: | Setting | Default | Description | |---------|---------|-------------| | Ready message type | *(empty — no wait)* | The message type value that signals readiness | | Handshake timeout | 30 seconds | How long to wait before timing out | **Example:** If your agent sends `{"type": "session_ready"}` when it's ready: - Set Ready message type to `session_ready` ## Direction Filtering If your agent echoes back your outbound messages (common with Genesys), configure direction filtering to skip those echoes: | Setting | Description | |---------|-------------| | Direction path | JSON path to the direction field (e.g., `direction`) | | Outbound direction value | The value indicating an agent-to-user message (e.g., `outbound`) | When configured, only messages matching the outbound direction value are processed. Messages without a direction field or with a different value are skipped. ## Setup Instructions 1. **Create the agent** — Navigate to [Agents](https://app.coval.dev/agents/create), select **Chat** as the agent type, then toggle to **WebSocket** protocol 2. **Choose connection mode** — Select Direct or HTTP-First depending on your platform 3. **Configure the endpoint** — Enter your `wss://` URL (Direct) or HTTP setup endpoint (HTTP-First) 4. **Set message format** — If your agent doesn't use the default `{"type": "message", "text": "..."}` format, customize the send template and receive paths under Advanced Configuration 5. **Test** — Create a test set with a single test case and launch a simulation to verify connectivity ## Common Patterns ### Pattern 1: Simple Direct Connection ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws ``` ### Pattern 2: Authenticated Direct Connection ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws Authorization Header: Bearer your-token-here Initialization JSON: {"action": "start_session", "channel": "web"} ``` ### Pattern 3: HTTP-First with Session Provisioning ``` Connection Mode: HTTP-First HTTP URL: https://api.example.com/v1/sessions HTTP Method: POST Request Body: {"channel": "web", "language": "en"} HTTP Headers: {"Authorization": "Bearer your-token"} WebSocket URL Response Path: data.websocket_url ``` ### Pattern 4: Custom Message Format ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws Send Template: {"event": "user_message", "payload": {"text": "{{message}}"}} Message Type Path: event Text Message Type Values: agent_message Message Text Path: payload.text ``` ## Troubleshooting ### Connection Failures **"Timeout connecting to WebSocket"** - Verify the `wss://` URL is correct and publicly accessible - Check that your server accepts WebSocket upgrade requests - Ensure firewall rules allow inbound WebSocket connections **"Failed to connect to WebSocket"** - Confirm the endpoint is running and healthy - Check authorization header format matches what your server expects - For HTTP-First: verify the HTTP setup endpoint returns a valid WebSocket URL ### No Messages Received - Check that your message type path and text message type values match what your agent actually sends - Verify the message text path points to the correct field - If using direction filtering, confirm the outbound direction value is correct - Try increasing the coalesce timeout if messages arrive after the batch window closes ### Handshake Timeout - Confirm your agent sends the expected ready message type - Check that the ready message is sent before the timeout (default 30s) - Verify the message type path resolves correctly on the ready message ### Messages Getting Dropped - If your agent echoes your messages back, configure direction filtering - Ensure `text_message_type_values` includes all message types your agent uses for text responses - Check agent logs for messages with unexpected type values ## Technical Requirements | Requirement | Details | |-------------|---------| | Protocol | `wss://` (TLS-encrypted WebSocket) | | Message format | JSON with configurable paths | | Accessibility | Must be publicly accessible from Coval servers | | Concurrency limit | 8 simultaneous simulations | --- ## Chat A2A (JSON-RPC) Source: https://docs.coval.ai/concepts/agents/connections/chat-a2a Connect to text chat agents that speak the A2A v2 JSON-RPC protocol ## Overview Chat A2A agents communicate over HTTPS using the A2A (Agent-to-Agent) v2 JSON-RPC protocol. Unlike the standard [Chat (HTTP)](/guides/simulations/chat) integration which uses a free-form request/response shape, A2A defines a structured `message/send` envelope with stateful conversation tracking via `contextId` and `taskId`, and an explicit end-of-conversation signal via `result.status.state`. **When to use Chat A2A instead of Chat (HTTP) or [Chat WebSocket](/concepts/agents/connections/chat-websocket):** - Your agent implements the A2A v2 JSON-RPC specification - Your agent requires OAuth2 token exchange before accepting messages - Your agent uses `contextId` / `taskId` to maintain conversation state across turns - Your agent signals end-of-conversation through a structured status field rather than a chat reply ## Conversation Flow A2A conversations follow a fixed three-phase shape: 1. **Initialization** — Coval calls your `initialization_endpoint` (typically OAuth2 `client_credentials`) and captures the returned access token for use as the `Authorization: Bearer …` header on subsequent requests. 2. **Message turns** — Coval sends each persona message to your `chat_endpoint` as a JSON-RPC `message/send` request. After the first turn, every request echoes back the `contextId` and `taskId` returned by the previous response so your agent can maintain conversation state. 3. **Termination** — When your agent's response sets `result.status.state` to `completed` or `failed` (configurable), Coval shuts down the simulation cleanly and finalizes the transcript. ## Configuration | Field | Required | Description | |-------|----------|-------------| | Chat Endpoint | Yes | URL Coval `POST`s each JSON-RPC `message/send` request to | | Initialization Endpoint | No | URL Coval calls once per simulation to obtain credentials (e.g., OAuth2 token endpoint) | | Initialization Payload | No | Body sent to the initialization endpoint. Stored as JSON; encoded based on Initialization Content-Type | | Initialization Content-Type | No | `application/x-www-form-urlencoded` (default) or `application/json` | | Authorization Header | No | Static auth value. Overridden automatically when the initialization response contains `access_token` | | Custom Headers | No | Additional headers sent on every chat request | | Response Message Path | No | Dot-notation path to the agent's reply text. Default: `result.artifacts.0.parts.0.text` | | Response State Extraction | No | JSON paths to the `contextId` and `taskId` Coval echoes back on each turn. Default: `{"contextId": "result.contextId", "taskId": "result.id"}` | | End State Path | No | Dot-notation path to the conversation status field. Default: `result.status.state` | | End State Values | No | List of status values that terminate the conversation. Default: `["completed", "failed"]` | | Custom Data | No | Arbitrary JSON object merged into `params.metadata` on every chat request (e.g., user identifiers required by your agent) | ## Initialization If your agent requires an access token, configure the initialization endpoint and payload. Coval calls it once at the start of each simulation, captures `access_token` from the response, and uses it as `Authorization: Bearer …` for every subsequent chat request. **Example: OAuth2 client_credentials initialization** **Configuration:** ``` Initialization Endpoint: https://auth.example.com/oauth_token.do Initialization Content-Type: application/x-www-form-urlencoded Initialization Payload: {"grant_type":"client_credentials","client_id":"...","client_secret":"..."} ``` **Request Coval sends:** ```http POST /oauth_token.do HTTP/1.1 Host: auth.example.com Content-Type: application/x-www-form-urlencoded grant_type=client_credentials&client_id=...&client_secret=... ``` **Response your auth endpoint should return:** ```json { "access_token": "eyJhbGc...", "token_type": "Bearer", "expires_in": 1799 } ``` Coval extracts `access_token` and uses it as `Authorization: Bearer eyJhbGc...` for every chat request in the simulation. Other fields (`token_type`, `expires_in`, `scope`) are captured but not interpreted. If you don't need authentication, leave Initialization Endpoint blank — Coval will skip the init phase and call your chat endpoint directly. ## Message Format ### Sending Messages (Coval to Agent) Each persona turn is sent as a JSON-RPC 2.0 `message/send` request. Coval auto-generates a fresh `messageId` (UUIDv4) per turn and echoes the previous turn's `contextId` and `taskId`. **Example: Turn 1 request envelope** ```json { "jsonrpc": "2.0", "id": "", "method": "message/send", "params": { "message": { "role": "user", "kind": "message", "messageId": "b1a2f3c4-d5e6-4a7b-8c9d-0e1f2a3b4c5d", "contextId": "", "taskId": "", "parts": [ {"kind": "text", "text": "Hi, I'd like to reschedule my appointment for next week."} ] }, "contextId": "", "taskId": "", "metadata": { "user_reference": "u_4f8c1a2b" } } } ``` On turn 1, `contextId` and `taskId` are empty strings — your agent should treat that as a new conversation and return its own values in the response. The `metadata` object is whatever you configured under Custom Data. **Example: Turn 2+ request envelope (with echoed state)** ```json { "jsonrpc": "2.0", "id": "", "method": "message/send", "params": { "message": { "role": "user", "kind": "message", "messageId": "c2d3e4f5-a6b7-4c8d-9e0f-1a2b3c4d5e6f", "contextId": "ctx_a1b2c3d4e5f6", "taskId": "task_x9y8z7w6v5u4", "parts": [ {"kind": "text", "text": "Tuesday at 2pm if possible."} ] }, "contextId": "ctx_a1b2c3d4e5f6", "taskId": "task_x9y8z7w6v5u4", "metadata": { "user_reference": "u_4f8c1a2b" } } } ``` `contextId` and `taskId` are present at both `params.*` and `params.message.*` for compatibility with strict A2A v2 implementations. ### Receiving Messages (Agent to Coval) Coval extracts the assistant text from the path configured under Response Message Path (default: `result.artifacts.0.parts.0.text`) and the conversation state from Response State Extraction. **Example: Expected response shape** ```json { "jsonrpc": "2.0", "id": "", "result": { "id": "task_x9y8z7w6v5u4", "kind": "task", "contextId": "ctx_a1b2c3d4e5f6", "status": { "state": "working", "timestamp": "2026-01-15T14:30:00.000Z" }, "artifacts": [ { "artifactId": "art_p1q2r3s4t5u6", "parts": [ {"kind": "text", "text": "Sure — I can help with that. What date and time work best for you?"} ] } ] } } ``` From this response Coval extracts: - **Reply text** from `result.artifacts.0.parts.0.text` → played back to the persona - **`contextId`** from `result.contextId` → echoed on the next request - **`taskId`** from `result.id` → echoed on the next request - **End-state** from `result.status.state` → if `"completed"` or `"failed"`, simulation terminates > **Warning:** **A2A is text-only.** All content the persona needs to act on must be serialized into `parts[*].text`. If your agent embeds rich content (lists, tables, attachments) as a structured non-text part or relies on a UI widget rendered outside the JSON-RPC payload, the simulator will not see it and the persona may loop or stall waiting for context that never arrives. If your agent returns multiple parts on the same artifact, Coval automatically picks the longest text part (some A2A implementations return a short header in `parts[0]` and the full content in `parts[1]`). ## State Extraction Response State Extraction tells Coval where to find the `contextId` and `taskId` your agent returns on each turn so they can be echoed back on the next request. The defaults match the A2A v2 specification: ```json { "contextId": "result.contextId", "taskId": "result.id" } ``` If your agent returns these values at different paths, override the map. The keys must be `contextId` and `taskId` (those are the slots Coval echoes back); the values are the dot-notation paths into your agent's response. Once captured, both values are sent on the next request at `params.contextId`, `params.taskId`, `params.message.contextId`, and `params.message.taskId` for compatibility with strict A2A v2 implementations. ## End-State Detection Coval checks every response against End State Path / End State Values. When the value at End State Path matches one of End State Values, the simulation finalizes cleanly after the current turn. The defaults (`result.status.state` ∈ `["completed", "failed"]`) match the A2A v2 specification. Override them if your agent uses different terminal states: ``` End State Path: result.status.lifecycle End State Values: ["resolved", "abandoned"] ``` ## Setup Instructions **Step: Create the agent** Create a new agent from the **Agents** tab in the [Coval platform](https://app.coval.dev/agents/create), or via the [Coval API](/api-reference/v1/agents/agents/connect-an-agent) with `model_type: "MODEL_TYPE_CHAT_A2A"` and the metadata fields described under Configuration above. **Step: Configure initialization** If your agent requires authentication, set the initialization endpoint, payload, and content type. Otherwise leave them blank. **Step: Pass user context via Custom Data** Any per-conversation identifiers your agent expects (user IDs, channel IDs, locale) go into Custom Data — they're merged into `params.metadata` on every chat request. **Step: Verify response shape** If your agent's response paths differ from the defaults, set Response Message Path, Response State Extraction, End State Path, and End State Values to match what your agent actually returns. **Step: Test with a single test case** Create a test set with one test case and launch a simulation to verify connectivity, state echoing, and end-state detection before scaling up. ## Common Patterns **Pattern 1: OAuth2 client_credentials with default A2A v2 paths** ``` Chat Endpoint: https://api.example.com/a2a/v2/agent/id/ Initialization Endpoint: https://auth.example.com/oauth_token.do Initialization Content-Type: application/x-www-form-urlencoded Initialization Payload: {"grant_type":"client_credentials","client_id":"...","client_secret":"..."} Custom Data: {"user_id": ""} ``` All response paths use defaults — no further configuration needed. **Pattern 2: Static bearer token, no init endpoint** ``` Chat Endpoint: https://api.example.com/a2a/v2/agent/id/ Authorization Header: Bearer ``` No initialization request — Coval sends the static `Authorization` header on every chat request. Use this when your agent accepts a long-lived API token instead of requiring an OAuth2 exchange per simulation. ## Troubleshooting ### Initialization Failures **"Failed A2A initialization request"** - Verify the initialization endpoint URL is correct and reachable from Coval servers - Check that the content type matches what the endpoint expects (form vs JSON) - Confirm the credentials in the initialization payload are valid **Auth works on init but chat requests return 401** - Confirm the init response includes a top-level `access_token` field — that's what Coval extracts - If your token field has a different name, set it via Authorization Header directly and skip the init endpoint ### State Not Persisting Across Turns - Check that Response State Extraction paths actually resolve in your agent's response (test with a direct request) - Confirm your agent reads `contextId` / `taskId` from the request and uses them to look up conversation history - A2A v2 expects state at both `params.contextId` and `params.message.contextId` — Coval sends both ### Conversation Never Ends - Verify End State Path resolves to a non-null value in your agent's responses - Confirm at least one End State Value matches what your agent actually returns when the conversation completes - If your agent never sends a terminal state, configure a reasonable test-case length so the simulation isn't open-ended ### Persona Loops or Stalls Mid-Conversation - Check that all content the persona needs is serialized as plain text in `parts[*].text` - If your agent produces a short header in `parts[0]` and the full content in `parts[1]`, Coval picks the longest — but neither should reference content that lives outside the JSON-RPC payload (e.g., "see the attached list") ## Technical Requirements | Requirement | Details | |-------------|---------| | Protocol | A2A v2 JSON-RPC over HTTPS | | Method | `message/send` | | Content type (chat) | `application/json` | | Content type (init) | Configurable (`application/x-www-form-urlencoded` default, `application/json` supported) | | Accessibility | Endpoints must be publicly accessible from Coval servers | | Payload size | Standard HTTP limits apply | --- ## OpenAI Realtime Source: https://docs.coval.ai/concepts/agents/connections/openai-realtime Evaluate an OpenAI Realtime voice-to-voice agent without hosting your own endpoint ## Overview OpenAI Realtime agents are voice-to-voice agents where Coval connects directly to the [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime) on your behalf. You provide an API key, an agent prompt, a voice, and a model — Coval handles everything else. No webhook, no SIP trunk, and no self-hosted server required. This makes OpenAI Realtime the fastest way to evaluate an OpenAI-based agent configuration: tune the prompt, change the voice, bump the temperature, and re-run the same test set to see how the change affects behavior. > **Info:** Use this connection when you want to evaluate an agent built directly on the OpenAI Realtime API. If your production agent runs on Pipecat, LiveKit, or a custom stack, connect it through the [matching integration](/concepts/agents/overview) instead. ## Configuration Requirements The only required field for a `MODEL_TYPE_OPENAI_REALTIME` agent is `metadata.openai_realtime_api_key`. Every other field below is optional — if omitted, Coval applies the default shown. ### OpenAI API Key - **Field**: `openai_realtime_api_key` - **Type**: String (required) - **Purpose**: Authenticates Coval's connection to the OpenAI Realtime API - **Format**: A valid OpenAI API key (typically starts with `sk-`) with Realtime API access enabled - **Security**: Stored encrypted and handled securely ### Agent System Instructions - **Field**: `openai_realtime_instructions` - **Type**: String (optional) - **Default**: `""` (model default instructions) - **Purpose**: System prompt sent as session instructions to the Realtime model - **Use Cases**: Role definition, behavior guidelines, response formatting - **Example**: `"You are a helpful customer support agent. Always be polite and provide accurate information."` ### Model - **Field**: `openai_realtime_model` - **Type**: String (optional) - **Default**: `gpt-realtime-2` (used when the field is omitted) - **Purpose**: Selects which OpenAI Realtime model powers the agent ### Agent Voice - **Field**: `openai_realtime_voice` - **Type**: String (optional) - **Default**: `alloy` - **Purpose**: Prebuilt voice used by the agent **Available Voices:** `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse` ### Temperature - **Field**: `openai_realtime_temperature` - **Type**: Number (optional) - **Default**: `0.8` - **Range**: `0.0` (deterministic) to `2.0` (very random) - **Purpose**: Controls randomness in the agent's responses ### Simulation Timeout - **Field**: `simulation_timeout_seconds` - **Type**: Integer (optional) - **Default**: `900` (15 minutes) - **Range**: `1` to `1800` (30 minutes) - **Purpose**: Maximum duration for a single simulated conversation ## Setup Instructions **Step: Get an OpenAI API key** Create a key in your [OpenAI dashboard](https://platform.openai.com/api-keys) and confirm it has Realtime API access. **Step: Create the agent in Coval** Navigate to [app.coval.dev/agents/create](https://app.coval.dev/coval/agents/create) and select **OpenAI Realtime** under Voice-to-Voice. **Step: Configure the agent** Paste your OpenAI API key, write your system instructions, pick a model and voice, and adjust temperature if needed. **Step: Run a simulation** Create a test set, launch a simulation, and review the transcript and metric outputs. ## Creating via the API OpenAI Realtime agents can also be created with the [v1 Agents API](/api-reference/v1/agents/agents/connect-an-agent): ```bash curl -X POST https://api.coval.dev/v1/agents \ -H "x-api-key: YOUR_COVAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "display_name": "OpenAI Realtime Voice Agent", "model_type": "MODEL_TYPE_OPENAI_REALTIME", "metadata": { "openai_realtime_api_key": "sk-...", "openai_realtime_instructions": "You are a helpful customer support agent...", "openai_realtime_model": "gpt-realtime-2", "openai_realtime_voice": "alloy", "openai_realtime_temperature": 0.8 } }' ``` ## How Simulations Work When you launch a simulation against an OpenAI Realtime agent, Coval: 1. Opens a connection to the OpenAI Realtime API using your API key, model, voice, and system instructions 2. Plays the simulated user's turns into the live session 3. Captures the agent's audio responses and transcribes them 4. Records the full conversation transcript 5. Runs your configured metrics against the transcript ## Troubleshooting **Invalid or missing API key** - Confirm the key is valid in your OpenAI dashboard - Verify your OpenAI account has Realtime API access **Voice rejected on save** - Voice must match one of the allow-listed values above. Other strings are rejected at save time. **Temperature validation error** - Temperature must be between `0.0` and `2.0`. Values outside that range are rejected at save time. **Agent is unresponsive or cuts off early** - Increase **Simulation Timeout** if conversations are being truncated - Check that your system instructions don't tell the agent to end calls immediately --- ## Gemini Live Source: https://docs.coval.ai/concepts/agents/connections/gemini-live Evaluate a Google Gemini Live voice-to-voice agent without hosting your own endpoint ## Overview Gemini Live agents are voice-to-voice agents where Coval connects directly to Google's [Gemini Live API](https://ai.google.dev/gemini-api/docs/live) on your behalf. You provide an API key, an agent prompt, a voice, and a model — Coval handles everything else. No webhook, no SIP trunk, and no self-hosted server required. This makes Gemini Live the fastest way to evaluate a Gemini-based agent configuration: tune the prompt, change the voice, bump the temperature, and re-run the same test set to see how the change affects behavior. > **Info:** Use this connection when you want to evaluate an agent built directly on Gemini Live. If your production agent runs on Pipecat, LiveKit, or a custom stack, connect it through the [matching integration](/concepts/agents/overview) instead. ## Configuration Requirements The only required field for a `MODEL_TYPE_GEMINI_REALTIME` agent is `metadata.gemini_realtime_api_key`. Every other field below is optional — if omitted, Coval applies the default shown. ### Google AI API Key - **Field**: `gemini_realtime_api_key` - **Type**: String (required) - **Purpose**: Authenticates Coval's connection to Gemini Live - **Format**: Any valid Google AI key. Supports both `AIzaSy…` (AI Studio) and `AQ.…` (Express Mode) prefixes. - **Security**: Stored encrypted and handled securely > **Note:** Authentication errors (invalid key, missing Gemini Live access) surface the first time you run a simulation, not at agent save time. If your first simulation fails with an auth error, double-check the key has Gemini Live enabled. ### Agent System Instructions - **Field**: `gemini_realtime_instructions` - **Type**: String (optional) - **Default**: `""` (model default instructions) - **Purpose**: System prompt sent to Gemini Live as the agent's `system_instruction` - **Use Cases**: Role definition, behavior guidelines, response formatting - **Example**: `"You are a helpful customer support agent. Always be polite and provide accurate information."` ### Model - **Field**: `gemini_realtime_model` - **Type**: String (optional) - **Default**: `models/gemini-3.1-flash-live-preview` (used when the field is omitted) - **Purpose**: Selects which Gemini Live model powers the agent **Available Models:** | Model ID | Description | |----------|-------------| | `models/gemini-3.1-flash-live-preview` | Gemini 3.1 Flash Live — current recommended model | | `models/gemini-2.0-flash-live-001` | Gemini 2.0 Flash Live — stable baseline | ### Agent Voice - **Field**: `gemini_realtime_voice` - **Type**: String (optional) - **Default**: `Charon` - **Purpose**: Prebuilt voice used by the agent **Available Voices:** | Voice | Description | |-------|-------------| | `Puck` | Upbeat, neutral | | `Charon` | Neutral, balanced | | `Kore` | Warm, female | | `Fenrir` | Deep, male | | `Aoede` | Bright, female | | `Leda` | Clear, female | | `Orus` | Authoritative, male | | `Zephyr` | Breezy, neutral | ### Temperature - **Field**: `gemini_realtime_temperature` - **Type**: Number (optional) - **Default**: `0.8` - **Range**: `0.0` (deterministic) to `2.0` (very random) - **Purpose**: Controls randomness in the agent's responses ### Simulation Timeout - **Field**: `simulation_timeout_seconds` - **Type**: Integer (optional) - **Default**: `900` (15 minutes) - **Range**: `1` to `1800` (30 minutes) - **Purpose**: Maximum duration for a single simulated conversation ## Setup Instructions **Step: Get a Google AI API key** Create a key at [Google AI Studio](https://aistudio.google.com/apikey) and confirm it has Gemini Live access. Both `AIzaSy…` (AI Studio classic) and `AQ.…` (Express Mode) keys are supported. **Step: Create the agent in Coval** Navigate to [app.coval.dev/agents/create](https://app.coval.dev/coval/agents/create) and select **Gemini Live** under Voice-to-Voice. **Step: Configure the agent** Paste your Google AI API key, write your system instructions, pick a model and voice, and adjust temperature if needed. **Step: Run a simulation** Create a test set, launch a simulation, and review the transcript and metric outputs. ## Creating via the API Gemini Live agents can also be created with the [v1 Agents API](/api-reference/v1/agents/agents/connect-an-agent): ```bash curl -X POST https://api.coval.dev/v1/agents \ -H "x-api-key: YOUR_COVAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "display_name": "Gemini Live Voice Agent", "model_type": "MODEL_TYPE_GEMINI_REALTIME", "metadata": { "gemini_realtime_api_key": "AIzaSy...", "gemini_realtime_instructions": "You are a helpful customer support agent...", "gemini_realtime_model": "models/gemini-3.1-flash-live-preview", "gemini_realtime_voice": "Charon", "gemini_realtime_temperature": 0.8 } }' ``` ## How Simulations Work When you launch a simulation against a Gemini Live agent, Coval: 1. Opens a connection to Gemini Live using your API key, model, voice, and system instructions 2. Plays the simulated user's turns into the live session 3. Captures the agent's audio responses and transcribes them 4. Records the full conversation transcript 5. Runs your configured metrics against the transcript ## Troubleshooting **Invalid or missing API key** - Double-check that the key is entered correctly with no extra whitespace - Confirm the key has Gemini Live access enabled in Google AI Studio **Agent is unresponsive or cuts off early** - Increase **Simulation Timeout** if conversations are being truncated - Check that your system instructions don't tell the agent to end calls immediately **Voice doesn't sound right** - Switch to a different voice from the allow-list — each has a distinct character - Confirm you're using a `native-audio` model, which supports the full voice catalog **Temperature validation error** - Temperature must be between `0.0` and `2.0`. Values outside that range are rejected at save time. --- ## SMS Simulations Source: https://docs.coval.ai/guides/simulations/sms Simulate SMS conversations with your agent ## Overview The SMS Simulator enables you to test and evaluate SMS-based AI agents by conducting automated text message conversations. It simulates a real customer interacting with your SMS agent, sending messages and receiving responses just as an actual user would. [Image: SMS Conversation] --- ## How It Works 1. **Test Case Delivery**: Each test case from your test set is sent as an SMS message to your configured phone number 2. **Agent Response**: Your SMS agent receives the message and responds 3. **Conversation Flow**: The simulator continues the conversation naturally, responding to your agent's messages as a realistic customer would 4. **Completion**: The conversation ends when the test scenario objective is achieved, the conversation reaches a natural conclusion, or the maximum simulation time (15 minutes) is reached 5. **Evaluation**: The complete message exchange is captured and evaluated against your configured metrics --- ## Setup ### 1. Create an SMS Agent 1. Navigate to **Agents** in your dashboard 2. Click **Create Agent** 3. Under the **Text** section, select **SMS** [Image: SMS Agent Selector] [Image: SMS Agent Selector] 4. Enter a **Display Name** for your agent (e.g., "Customer Support SMS Bot") 5. Enter your agent's **Phone Number** in E.164 format: - Format: `+[country code][number]` - Example: `+14155551234` (US number) - Example: `+442071234567` (UK number) [Image: SMS Agent Configuration] [Image: SMS Agent Configuration] ### 2. Configure Your Agent (Optional) You can add additional configuration: - **System Prompt**: Context that will not affect the simulation but allows better context for generating test sets, workflows, and metrics for this agent - **Attributes**: Custom metadata tags for organizing your agents ### 3. Run an Evaluation 1. Go to **Evaluations** and click **New Evaluation** 2. Select your SMS agent 3. Choose a **Test Set** containing the conversations you want to simulate 4. Select **Metrics** to evaluate (e.g., response quality, task completion) 5. Configure run settings: - **Iterations**: Number of times to run each test case (default: 10) - **Concurrency**: Parallel simulations (default: 5) 6. Click **Launch** --- ## Key Features - **SMS-Optimized**: Responses are kept short and concise to reflect real SMS communication patterns - **Full Transcripts**: Every message is captured with timestamps for detailed analysis --- ## Metrics Compatibility SMS simulations generate text transcripts that are compatible with all standard text-based metrics, including: - Response accuracy - Conversation completion rate - Goal achievement - Custom LLM-judged metrics --- ## Best Practices 1. **Phone Number Format**: Always use E.164 format for phone numbers (e.g., `+1` followed by 10 digits for US numbers) 2. **Use Test Numbers**: Use a test/staging number rather than production during initial testing 3. **Start with Low Concurrency**: Begin with low concurrency to avoid rate limiting on your SMS provider 4. **Design Realistic Scenarios**: Include varied test cases that cover different user intents and edge cases 5. **Keep Messages Concise**: Your agent's responses should be brief—SMS has character limitations and customers expect short messages 6. **Response Time**: Your agent should respond within a reasonable timeframe for accurate simulation flow --- ## Limitations - Maximum simulation duration: 15 minutes per conversation - Agent must be accessible via standard SMS messaging --- ## WebSocket (Voice) Source: https://docs.coval.ai/concepts/agents/connections/websocket Connect Coval to a real-time voice agent over WebSocket using configurable audio messages ## Overview WebSocket voice agents stream audio over a single persistent WebSocket connection. Coval can exchange raw binary PCM frames or JSON envelopes that wrap base64-encoded PCM / MP3 audio, plus configured non-audio events (cart updates, session signals) the agent emits. Use this connection type for voice agents that: - Stream audio over WebSocket rather than SIP, WebRTC, or HTTP. - Receive Coval's Linear PCM audio at a fixed sample rate and return PCM or MP3 audio. - Optionally send structured side-events, such as cart updates or session status messages, alongside audio. For text-only WebSocket agents (token-by-token chat) see [Chat WebSocket](/concepts/agents/connections/chat-websocket). ## Connection modes | Mode | When to use | | --- | --- | | **Direct** | The agent exposes a stable `wss://` URL Coval can dial directly. | | **HTTP-first** | The agent requires an HTTP setup call to provision a per-session WebSocket URL before the audio stream begins. | In HTTP-first mode, Coval makes the configured HTTP request, extracts the WebSocket URL using `websocket_url_response_path`, and opens the audio WebSocket against that URL. ## Authentication WebSocket voice agents authenticate during the WebSocket upgrade. - **Authorization header** — set `authorization_header` to the auth value Coval should send during the WebSocket upgrade. Values like `Bearer ` and `Basic ` are sent as the `Authorization` header value. Values like `X-API-Key ` are sent as the `X-API-Key` header. - **Query-string token** — when the agent only supports browser-style auth, encode the token directly in the `endpoint`, for example `wss://example.com/ws?token=...`. - **Custom headers** — `custom_headers` accepts additional upgrade headers. In the UI, add header name/value rows. Through the API, send `metadata.custom_headers` as a JSON object or as a JSON-encoded object string, for example `{"X-Foo":"bar"}` or `"{\"X-Foo\":\"bar\"}"`. The UI masks the authorization header field. Do not pass JSON in `authorization_header`; use `custom_headers` for additional named headers. Tokens included directly in the `endpoint` query string may be visible anywhere URLs are logged, so prefer `authorization_header` when the agent supports it. ## Audio transport Audio can be exchanged as raw PCM bytes or as JSON envelopes containing a base64-encoded audio payload. The default JSON shape is `audio_chunk` / `data`; the [JSON audio preset](#json-audio-preset) uses `audio_message` / `audio_bytes`. Both JSON shapes are configurable per agent, and setting `send_audio_template` to exactly `{{audio_data}}` makes outbound audio raw bytes instead. Coval’s simulator only sends Linear PCM. The JSON audio preset uses: - Codec: PCM (linear) - Sample rate: 16 000 Hz - Bit depth: 16-bit - Endianness: little-endian - Channels: 1 (mono) - Recommended frame duration for peer implementations: 20-100 ms Message frames can be bidirectional. Use `audio_message_type_value` to identify the agent frames that contain inbound audio, and use `send_audio_template` to shape Coval-originated audio frames. For the JSON audio preset, Coval sends `audio_message` frames with `sender: "USER"` and the agent should send its own `audio_message` frames with `sender: "AI"`. ### Audio format fields | Field | Default | Purpose | | --- | --- | --- | | `endpoint` | – (required) | `wss://` URL Coval connects to in direct mode. Plain `ws://`, `http://`, and `https://` endpoints are rejected for direct WebSocket connections. | | `connection_mode` | `direct` | `direct` or `http_first`. | | `initialization_json` | empty | Optional JSON object Coval sends after the WebSocket upgrade and before any ready-message wait. | | `send_audio_template` | `{"type":"audio_chunk","data":"{{audio_data}}"}` | Outbound JSON template. Must contain `{{audio_data}}`. Setting it to literally `{{audio_data}}` sends raw PCM bytes (no JSON wrapping). | | `message_type_path` | `type` | Dot-notation path to the field that names the message kind. | | `audio_message_type_value` | `audio_chunk` | Value that identifies an inbound audio frame. Use `*` to treat every JSON message as audio. | | `audio_data_path` | `data` | Dot-notation path to the base64 audio payload inside an inbound frame. | | `audio_encoding` | `pcm` | Inbound JSON audio payload encoding: `pcm` or `mp3`. MP3 frames are decoded to 16 kHz mono PCM before evaluation. | | `receive_audio_channels` | `2` | `1` for mono inbound JSON PCM, `2` to keep the legacy stereo-to-mono averaging behavior. | | `send_sample_rate_hertz` | `16000` | Outbound sample rate Coval sends to the agent. Allowed: 8 000, 16 000, 24 000, 48 000. | | `receive_sample_rate_hertz` | `48000` | Sample rate the agent sends. Allowed: 8 000, 16 000, 24 000, 48 000. | | `pipeline_sample_rate_hertz` | `16000` | Coval processing rate; must stay 16 000. | | `pace_inbound_binary_audio` | inferred | Pace inbound binary PCM in real time so resampling and metrics see realistic timing. Defaults on when outbound audio is configured for raw PCM bytes and off for JSON templates. | The default receive rate is higher than the send rate because many voice integrations return higher-rate audio while receiving 16 kHz audio from Coval. Paths use dot notation for nested fields, for example `payload.audio.data`. Match `send_sample_rate_hertz` / `receive_sample_rate_hertz` to the agent's actual stream format; mismatched sample rates can cause speed, pitch, or quality issues. ### HTTP-first setup fields These fields apply when `connection_mode` is `http_first`: | Field | Default | Purpose | | --- | --- | --- | | `http_url` | – (required) | `https://` setup endpoint Coval calls before opening the WebSocket. | | `http_method` | `POST` | Setup request method. Allowed: `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `HEAD`, `OPTIONS`. | | `http_request_body` | `{}` | JSON object body for the setup request. | | `http_headers` | `{}` | JSON object of headers for the setup request. | | `websocket_url_response_path` | – (required) | Dot-notation path to the WebSocket URL in the setup response, for example `data.websocket_url`. | | `authorization_header` | empty | Auth value for the WebSocket upgrade after setup. This is separate from `http_headers`. | | `custom_headers` | `{}` | Additional headers for the WebSocket upgrade after setup. | ### Handshake | Field | Default | Purpose | | --- | --- | --- | | `handshake_ready_message_type` | `session_ready` in direct mode; empty in HTTP-first mode | Set to an empty string to skip the ready-message wait. | | `handshake_requires_session_id` | `true` in direct mode; `false` in HTTP-first mode | When `true`, the ready message must include `session_id`. | | `handshake_timeout_seconds` | `30` | Seconds Coval waits for the ready message. | With the default `message_type_path` of `type`, a direct-mode ready message looks like: ```json { "type": "session_ready", "session_id": "abc123" } ``` If you customize `message_type_path`, Coval uses that same path to find the ready-message type. ### Non-audio event capture Many voice agents emit side-events alongside the audio stream — cart updates, transcript fragments, session telemetry. By default, Coval ignores non-audio JSON messages. To tell Coval which message types to accept, set: ```json { "non_audio_event_message_types": ["system_notify"] } ``` For each matching message, Coval reads: - `event_type` — the value at `message_type_path` (for example `system_notify`). - `event_name` — the optional `event` field from the payload (for example `ocb:cart-updated`). - `payload` — the full parsed JSON message. For example, when `message_type_path` is `action` and `non_audio_event_message_types` includes `system_notify`, this inbound message is accepted as a non-audio event: ```json { "action": "system_notify", "event": "ocb:cart-updated", "payload": { "items": [ { "name": "latte", "quantity": 1, "modifiers": ["oat milk"], "price": 4.5 } ], "subtotal": 4.5, "total": 4.5 } } ``` Coval does not emit these events to your agent. It only receives configured message types and ignores unconfigured non-audio JSON messages. Accepted event messages are stored with the simulation transcript as `websocket_event` entries. Transcript-based metrics, including LLM judge metrics, see JSON that includes `event_type`, `event_name`, and the full `payload`, so they can evaluate structured side-channel data such as cart contents, selected menu items, modifiers, quantities, and prices alongside the spoken conversation. ### Media (image) frames Voice WebSocket simulations can attach images from a test case mid-conversation. `send_media_template` controls the outbound shape: ```json { "type": "media", "name": "{{media_name}}", "mime_type": "{{mime_type}}", "data": "{{media_data}}" } ``` Template rules: - `{{media_data}}` is required. - `{{media_name}}` and `{{mime_type}}` are optional placeholders. - If the template is exactly `{{media_data}}`, Coval sends raw bytes. - Otherwise, Coval base64-encodes the image and substitutes it into your JSON template. See [Test Sets — image attachments](/concepts/test-sets/overview#5-image-attachment) for the test-case side. ## Examples Initialization payload: ```json { "action": "start_session", "session_type": "simulation", "metadata": { "source": "coval", "test_mode": true } } ``` Custom WebSocket upgrade headers: ```json { "X-Client-ID": "coval-simulation", "X-API-Version": "2024-01", "X-Environment": "production" } ``` ## JSON audio preset The agent UI ships a `JSON audio` preset that fills the metadata for JSON audio WebSocket agents. It sets: ```json { "connection_mode": "direct", "websocket_compat_profile": "json_audio", "initialization_json": "", "handshake_ready_message_type": "", "handshake_requires_session_id": false, "send_sample_rate_hertz": 16000, "receive_sample_rate_hertz": 16000, "send_audio_template": "{\"action\":\"audio_message\",\"payload\":{},\"audio_bytes\":\"{{audio_data}}\",\"sender\":\"USER\"}", "message_type_path": "action", "audio_message_type_value": "audio_message", "audio_data_path": "audio_bytes", "audio_encoding": "pcm", "receive_audio_channels": 1, "non_audio_event_message_types": ["system_notify"] } ``` Set `authorization_header` to `Bearer ` after picking the preset if the agent requires auth (most production endpoints do). ## Setup 1. **Prepare the agent endpoint.** Confirm `wss://` is reachable, audio format matches the configuration above, and decide whether the agent requires Bearer auth. 2. **Create the agent in Coval.** Open the Agents page in your Coval org, choose **WebSocket** as the connection type, and either fill the fields manually or apply the JSON audio preset. 3. **Smoke test.** Build a small test set with a single voice persona and run a simulation. The transcript should show alternating turns, the result page should expose usable audio, and any configured side-events should be available to transcript-based metrics. ## How simulations work 1. Coval performs any HTTP-first setup, then opens the WebSocket with any configured Bearer token or custom headers. 2. If `handshake_ready_message_type` is set, Coval waits for the ready message before sending audio. 3. Coval streams persona audio outward using `send_audio_template` at the configured sample rate: raw PCM bytes for `{{audio_data}}`, or JSON text frames for any JSON template. 4. Inbound binary frames or matching JSON audio frames are decoded and resampled if needed. 5. Inbound non-audio JSON messages whose type is in `non_audio_event_message_types` are accepted; unconfigured non-audio messages are ignored. 6. When the persona finishes, Coval closes the WebSocket cleanly. ## Troubleshooting **Empty transcript with audio frames flowing.** Check that `audio_message_type_value` matches the agent’s field, that `audio_data_path` points at the base64 payload, and that `audio_encoding` matches the wire format. **Inbound audio sounds half-speed or distorted.** Confirm `receive_audio_channels`. JSON PCM that arrives mono should be configured with `receive_audio_channels: 1`; the historical default `2` averages two channels and halves the apparent rate when the source is mono. **Cart events / status messages look ignored.** Add the action value to `non_audio_event_message_types`. Without it, Coval ignores non-audio JSON messages. **Auth failures during handshake.** Verify the `authorization_header` value, or move the token to a `?token=...` query string when the agent only supports browser-style auth. **Connection refused locally.** Tunnel the agent’s `ws://` server through ngrok or Cloudflare Tunnel and use the resulting `wss://` URL as the agent endpoint. ```bash ngrok http 8080 # Use the generated wss:// URL as the agent endpoint. ``` If your tunnel provider shows an `https://` URL, use the corresponding `wss://` URL in Coval. Update the agent configuration when the tunnel URL changes, or use a reserved tunnel domain for a stable endpoint. **Unreadable audio or media payloads.** For JSON audio/media templates, Coval substitutes base64 data into `{{audio_data}}` / `{{media_data}}`; for raw templates, the agent must expect raw PCM or media bytes. Verify the JSON is valid, the configured message fields match the agent payload, `audio_encoding` is correct, and `send_media_template` includes `{{media_name}}` / `{{mime_type}}` when the agent needs file metadata. **Timeouts or no response.** Confirm the agent keeps the WebSocket open for the whole conversation, processes incoming audio frames without blocking, sends audio responses in the configured shape, and logs initialization / ready messages while testing. ## Best practices 1. Pick the JSON audio preset (or a similar named preset) instead of hand-filling fields when one exists. It keeps the metadata canonical for the agent shape. 2. Mirror the agent’s sample rate exactly in `send_sample_rate_hertz` / `receive_sample_rate_hertz`. Resampling is supported but degrades audio. 3. Capture the side-events you care about by adding their `action` values to `non_audio_event_message_types`. Don’t silently rely on the agent emitting them. 4. Keep the agent’s WebSocket handler long-lived and avoid closing the connection while the simulation is active. 5. Log initialization payloads, ready messages, and payload parsing errors during initial setup. 6. Rotate Bearer tokens on a schedule; Coval re-reads the value at every connection setup. --- ## Pipecat Cloud Source: https://docs.coval.ai/concepts/agents/connections/pipecat Connect your Pipecat Cloud agent to Coval to run Voice AI simulations ## Overview The Pipecat Cloud connection lets you run Coval simulations against agents hosted on [Pipecat Cloud](https://docs.pipecat.daily.co). Once connected, Coval can automatically call your Pipecat agent, simulate realistic conversations, and evaluate its performance — no manual testing required. ## Configuration Requirements ### Agent Name - **Field**: `agent_name` - **Type**: String (required) - **Purpose**: The name of the Pipecat Cloud agent that Coval will call during simulations - **Format**: Must exactly match the agent name in your Pipecat Cloud dashboard - **Example**: `"customer-service-agent"`, `"sales-assistant"` ### Pipecat API Key - **Field**: `pipecat_api_key` - **Type**: String (required) - **Purpose**: Authentication key that allows Coval to connect to your Pipecat Cloud agent - **Format**: Valid API key string from your Pipecat account - **Security**: Stored encrypted and handled securely - **Example**: `"pk_live_abc123def456..."` ### Custom Data - **Field**: `custom_data` - **Type**: String (optional) - **Purpose**: Additional context passed to your Pipecat agent at the start of each simulation - **Format**: Valid JSON string - **Use Cases**: Agent-specific parameters, session context, custom settings - **Example**: `{"department": "support", "language": "en", "priority": "high"}` ## Setup Instructions 1. **Deploy your agent to Pipecat Cloud** - Follow the [Pipecat Cloud Quickstart](https://docs.pipecat.daily.co/quickstart) to deploy your agent - Confirm your agent is listed in the Pipecat Cloud dashboard 2. **Connect it to Coval** - Go to [app.coval.dev/coval/agents/create](https://app.coval.dev/coval/agents/create) - Enter the exact agent name as it appears in Pipecat Cloud - Input your Pipecat API key - Add any required custom data in JSON format 3. **Run a simulation** - Create a test set with scenarios for your agent - Launch a simulation to verify the connection works end-to-end ## How Simulations Work When you launch a simulation, Coval: 1. Authenticates with Pipecat Cloud using your API key 2. Starts a session with your specified agent (along with any custom data) 3. Simulates a realistic voice conversation using your test set and persona configuration 4. Records the full interaction and runs your configured evaluation metrics ## Troubleshooting **Common Issues:** - **Authentication Failures**: Verify your Pipecat API key is valid and has the correct permissions - **Invalid Custom Data**: Ensure the JSON format is valid --- ## LiveKit Source: https://docs.coval.ai/concepts/agents/connections/livekit Connect to LiveKit real-time communication platform for advanced audio/video agents ## Overview LiveKit connection enables integration with agents built on LiveKit's real-time communication platform for audio, video, and data streaming. This connection type supports both LiveKit Cloud and self-hosted LiveKit deployments. ## Configuration Requirements ### Generate Token Endpoint - **Field**: `generate_token_endpoint` - **Type**: String (required) - **Purpose**: Endpoint for generating LiveKit access tokens - **Format**: Valid HTTPS URL - **Example**: `https://your-api.com/livekit/token` Coval sends a POST request to this endpoint with: ```json { "room_name": "uuid-generated-by-coval", "participant_name": "simulated_user" } ``` Your endpoint should return: ```json { "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", "serverUrl": "wss://your-livekit-server.com", "room_name": "uuid-generated-by-coval" } ``` ### LiveKit URL - **Field**: `livekit_url` - **Type**: String (required) - **Purpose**: LiveKit server WebSocket URL - **Format**: Valid WebSocket URL (wss://) - **Example**: `wss://your-livekit-server.com` > **Note:** If your token endpoint returns `serverUrl` or `server_url`, that value will override this configuration. ### Generate Token Headers - **Field**: `generate_token_headers` - **Type**: String (optional) - **Purpose**: HTTP headers for token generation requests - **Format**: Valid JSON string - **Example**: `{"Authorization": "Bearer your-api-key", "Content-Type": "application/json"}` ### Sandbox ID - **Field**: `sandbox_id` - **Type**: String (optional) - **Purpose**: LiveKit Cloud sandbox identifier for automatic agent dispatch - **When to use**: Only required when using LiveKit Cloud's managed sandbox feature - **When to skip**: Leave empty if self-hosting LiveKit or using your own agent dispatch system > **Info:** **Sandbox ID is optional for most users.** It's a LiveKit Cloud-specific feature for automatic agent dispatch. If you're running your own LiveKit server or managing agent dispatch yourself, you don't need this field. ### LiveKit Agent Name - **Field**: `livekit_agent_name` - **Type**: String (optional) - **Purpose**: Name identifier for the LiveKit agent - **Format**: String identifier - **Example**: `"voice-assistant"`, `"video-agent"` ### LiveKit Agent Metadata - **Field**: `livekit_agent_metadata` - **Type**: String (optional) - **Purpose**: Additional metadata for the LiveKit agent - **Format**: String or JSON string - **Example**: `"{'role': 'assistant', 'capabilities': ['voice', 'video']}"` ### Custom Payload Fields - **Field**: `token_request_payload` - **Type**: String (optional) - **Purpose**: Additional fields to include in the token request - **Format**: Valid JSON string - **Example**: `{"agent_variant": "sales", "language": "en"}` These fields are merged with `room_name` and `participant_name` when calling your token endpoint. ## Setup Instructions 1. Set up a token generation endpoint that accepts POST requests 2. Configure your endpoint to return tokens with the correct room permissions 3. Enter your LiveKit server WebSocket URL 4. (Optional) Add authentication headers if your endpoint requires them 5. Test the connection by launching a simulation ## Technical Details ### Token Generation Flow 1. Coval generates a unique room name (UUID) for each simulation 2. Coval sends a POST request to your token endpoint with the room name 3. Your endpoint generates a LiveKit JWT token with room access permissions 4. Your endpoint should also dispatch your agent to join the same room 5. Coval joins the room using the returned token 6. Coval waits for your agent to join (`on_first_participant_joined` event) 7. Conversation simulation begins ### Accepted Response Field Names Coval accepts multiple field name variations for flexibility: | Data | Accepted Field Names | |------|---------------------| | Token | `token`, `participantToken`, `accessToken`, `participant_token`, `access_token` | | Server URL | `serverUrl`, `server_url` | | Room Name | `roomName`, `room_name` | ## Troubleshooting ### Common Issues **Token Generation Failures** - Check endpoint accessibility and authentication - Verify your endpoint returns valid JSON with a `token` field - Ensure HTTPS is properly configured **WebSocket Connection Errors** - Verify LiveKit server URL starts with `wss://` - Check that `serverUrl` is included in your token response - Confirm your LiveKit server is running and accessible **Agent Not Joining Room** - Ensure your agent dispatch system receives the `room_name` from token requests - Verify your agent is running and connected to LiveKit - Check that the token grants access to the correct room **Simulation Timeouts** - Coval waits for your agent to join before starting - If your agent doesn't join, the simulation will timeout - Check your agent logs for connection errors **"No token found in response" Error** - Verify your response includes a recognized token field name - Check that the token value is a non-empty string - Ensure response Content-Type is `application/json` ### Running Components Locally Coval's servers need to reach your **token endpoint** and **LiveKit server** to run simulations. Here's what needs to be publicly accessible: | Component | Must be public? | Why | |-----------|-----------------|-----| | Token endpoint | Yes | Coval calls it to get access tokens | | LiveKit server | Yes | Coval connects via WebSocket | | Your LiveKit agent | No | Only needs outbound connection to LiveKit | **If running your token server locally:** Use a tunneling service like [ngrok](https://ngrok.com) to expose it: ```bash ngrok http 8888 # Use the generated https:// URL as your token endpoint ``` > **Note:** Your agent can run on your local machine without any tunneling—it just connects outbound to the LiveKit server like any other client. --- ## Personas Source: https://docs.coval.ai/concepts/personas/overview Generate realistic simulated personas that reflect your end-users. Personas define the characteristics of the simulated user interacting with your agent. Configure voice, accent, behavior, and more to match your real user base. ## Creating a Persona 1. Navigate to the Personas section 2. Click "Create New Persona" 3. Configure the persona settings ![Persona Configuration](/images/personas/persona-config-dec-25.png) ## Configuration Options ### Avatar Customize the persona's visual representation: - Select from various hair, eye, and lip styles - Regenerate avatar seed for a new base face ### Persona Label Display name for the persona (required). ### Persona Characteristics Define the persona's demographics, personality, and communication style (required). Use the expand button (Shift+E) for a full-screen editor. ### Voice Configuration | Setting | Description | |---------|-------------| | **Voice** | Select from available voices with vocal presentation and accent metadata. Preview available for each voice. | | **Voice Pitch** | Filter voices by low, medium, or high curated pitch profile. This selects a different voice from the catalog; it does not apply a pitch-shifting effect. | | **Language & Accent** | Choose language and regional accent. Supported languages vary by voice and include English, Spanish, French, French (Canada), German, Hindi, Korean, Arabic, Hebrew, Bulgarian, Danish, Lithuanian, and Polish. Arabic and Hebrew transcripts render in their native script with right-to-left alignment. | | **Background Noise** | Add ambient noise to simulate real-world calling environments. Volume is adjustable with a slider. | > **Info:** **Voice Pitch profiles:** Voice Pitch is available for voices that include curated pitch variants. The filter narrows the voice selection to those variants — it is not a pitch-shifting effect applied after synthesis, so changing the pitch profile selects a different underlying voice. **Available Voices** Voices fall into two categories based on realism and concurrency. Higher-realism voices sound more natural and expressive but have a concurrency limit of approximately 12 simultaneous connections. For high-volume simulation runs, use higher-concurrency voices to avoid bottlenecks. **Higher Concurrency (34 voices)** | Voice | Accent / Language | Presentation | |-------|-------------------|--------------| | Aria | Multiple | Feminine | | Ashwin | Multiple | Masculine | | Autumn | Multiple | Feminine | | Brynn | Multiple | Feminine | | Callum | Multiple | Masculine | | Caspian | Multiple | Masculine | | Corwin | Multiple | Masculine | | Darrow | Multiple | Masculine | | Delphine | Multiple | Feminine | | Dorian | Multiple | Masculine | | Elara | Multiple | Feminine | | Kieran | Multiple | Masculine | | Lysander | Multiple | Masculine | | Marina | Multiple | Feminine | | Naveen | Multiple | Masculine | | Orion | Multiple | Masculine | | Rowan | Multiple | Masculine | | Emma | American | Feminine | | Kent | American | Masculine | | Sydney | American | Feminine | | John | American | Masculine | | Eva | British | Feminine | | Jack | British | Masculine | | Zoey | American | Feminine | | Harper | American | Feminine | | Riley | American | Feminine | | Skyler | American | Feminine | | Quinn | American | Feminine | | Lukas | Eastern European | Masculine | | Soren | Multiple | Masculine | | Skye | Multiple | Feminine | | Vera | Multiple | Feminine | | Isabella | Latin America | Feminine | | Alexei | Eastern European | Masculine | **Higher Realism (35 voices) — Limited Concurrency** | Voice | Accent / Language | Presentation | |-------|-------------------|--------------| | Alejandro | Latin America | Masculine | | Angela | American | Feminine | | Erika | Latin America | Feminine | | Harry | Multilingual | Masculine | | Amir | Arabic | Masculine | | Layla | Arabic | Feminine | | Noa | Hebrew | Feminine | | Yossi | Hebrew | Masculine | | Mark | American | Masculine | | Monika | Indian | Feminine | | Raju | Indian | Masculine | | Kehinde | Nigerian | Feminine | | Victor | Nigerian | Masculine | | Luis | Spanish-Accented | Masculine | | Lisa | Spanish-Accented | Feminine | | Leo | Spanish-Accented | Masculine | | Martin | Spanish-Accented | Masculine | | Marshal | German-Accented | Masculine | | Raven | German-Accented | Feminine | | Puthina | Malay | Feminine | | Darryl | Malay | Masculine | | Juvy | Filipino | Feminine | | Pedro | Filipino | Masculine | | Rachel | Filipino | Feminine | | Burak | Turkish-Accented | Masculine | | Walker | US Southern | Feminine | | Cletus | US Southern | Masculine | | Bubba | US Southern | Masculine | | Jay | Chinese-Accented | Masculine | | Ziyu | Chinese-Accented | Feminine | | Isla | Scottish | Feminine | | Chris | Scottish | Masculine | | Alok | Indian | Masculine | | Mani | Indian | Masculine | | Vidya | Indian | Feminine | > **Warning:** **Concurrency limit:** Higher-realism voices support a maximum of approximately 12 simultaneous connections. If you are running a large volume of simulations, these voices can become a bottleneck. Use higher-concurrency voices for high-volume runs. **Available Background Sounds** | Sound | Description | |-------|-------------| | **Off** | No background noise (default) | | **Office** | Office ambience | | **Lounge** | People in a lounge | | **Crowd Talking** | Crowd conversation noise | | **Airport Boarding** | Airport boarding announcements | | **Bus Interior** | Inside a bus | | **Kids Playing** | Playground sounds | | **Doorbell** | Doorbell ringing | | **Train Arrival** | Train station arrival sounds | | **Portable AC** | Air conditioner hum | | **Skatepark** | Skatepark ambience | | **Small Dog Bark** | Small dog barking | | **Cafe** | Cafe ambience | | **Ferry Announcement** | Ferry and PA announcements | | **Heavy Rain** | Heavy rainfall | | **Moderate Wind** | Wind sounds | | **Newborn Baby Crying** | Baby crying | | **Office with Alarm** | Office with alarm going off | | **Street with Sirens** | Street traffic with sirens | | **Construction Work** | Construction site noise | ### Conversation Initiator | Option | Behavior | |--------|----------| | **Persona waits to speak** | Waits for the agent to speak first. | | **Persona speaks first** | Persona initiates the conversation. | ### Interruption Rate Controls how often the persona proactively interrupts the agent during a conversation. This simulates impatient or talkative callers who don't wait for the agent to finish speaking. | Option | Behavior | |--------|----------| | **None** | The persona never proactively interrupts the agent (default). | | **Low** | The persona occasionally interrupts (roughly every 90 seconds). | | **Medium** | The persona interrupts at moderate frequency (roughly every 45 seconds). | | **High** | The persona frequently interrupts (roughly every 30 seconds). | > **Info:** **Note on natural turn-taking:** Even with Interruption Rate set to None, you may observe occasional overlapping speech between the persona and agent. This is expected behavior caused by natural voice conversation turn-taking, where the speech-to-text engine detects a pause in the agent's speech and the persona begins responding before the agent has fully finished. This is distinct from proactive interruptions and reflects realistic phone conversation dynamics. > > To minimize this, add instructions in your persona prompt like: "Always wait for the agent to completely finish speaking before responding." > > See [Interruption Behavior](#interruption-behavior) below for more details. ### Multi-Language STT Enable multilingual speech recognition so the persona can accurately hear and respond to agents that speak multiple languages in the same conversation (e.g. "For English press one, Para español presione dos"). Found under **Advanced** in the persona configuration modal. | Setting | Description | |---------|-------------| | **Off** (default) | Speech recognition is set to the persona's configured language for best single-language accuracy. | | **On** | Speech recognition accepts all supported languages simultaneously. Supports English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. | > **Tip:** If your agent starts with a multi-language greeting or IVR menu, create a persona with Multi-Language STT enabled. You can clone an existing persona and toggle it on — this lets you test the same agent with both single-language and multi-language speech recognition. ### Hold Music Timeout Configure the persona to disconnect after a period of silence or hold music. When enabled, the simulation ends as soon as the configured number of seconds pass with no speech activity, rather than waiting for the default timeout cycle. Found under **Advanced** in the persona configuration modal. | Setting | Description | |---------|-------------| | **Off** (default) | The simulation uses the standard timeout behavior. | | **On** | The simulation disconnects after the specified number of seconds (5–300) of no speech activity. | This is useful for testing scenarios where your agent transfers the caller to a hold queue or live agent. Instead of the persona waiting through several minutes of hold music, it disconnects promptly after the configured timeout. > **Tip:** For testing live agent transfer flows, set the hold music timeout to 10–15 seconds. This lets the simulation confirm the transfer happened without waiting through extended hold music. ### Silent Mode When enabled, the persona remains completely silent throughout the conversation. The persona will not respond to anything the agent says. This is useful for testing how your agent handles unresponsive callers, dead air, or scenarios where the caller has put their phone down. When silent mode is enabled, all other behavioral settings (background sound, interruption rate, conversation initiator) are automatically disabled. ### Caller Phone Number Configure phone number routing for voice simulations. See phone number mappings below. > **Info:** **Caller Phone Number** for Voice Simulations: > > Coval uses different phone numbers depending on the simulation type. Assign a specific phone number index to a persona if your workflow depends on phone number routing. > > ## Inbound Voice Simulations > > For **inbound** simulations (Coval calls your agent), assign up to 29 phone numbers to a persona. > > **View Available Inbound Phone Number Mappings** > > | Index | Phone Number | > |-------|--------------| > | 1 | +16504471573 | > | 2 | +16506400392 | > | 3 | +16506329775 | > | 4 | +16505360811 | > | 5 | +16505360576 | > | 6 | +15418450089 | > | 7 | +15412194880 | > | 8 | +14157181081 | > | 9 | +14157180538 | > | 10 | +14157180269 | > | 11 | +14153765034 | > | 12 | +14069058267 | > | 13 | +14066920094 | > | 14 | +14064159042 | > | 15 | +14063022479 | > | 16 | +14063022353 | > | 17 | +17182801764 | > | 18 | +17182858503 | > | 19 | +17182859858 | > | 20 | +17183051836 | > | 21 | +17187195385 | > | 22 | +17187195407 | > | 23 | +15342172296 | > | 24 | +15342172366 | > | 25 | +15342172371 | > | 26 | +15342172387 | > | 27 | +19855295712 | > | 28 | +19858539008 | > | 29 | +19858539188 | > > > > ## Outbound Voice Simulations > > For **outbound** simulations (your agent calls Coval's simulated user), select a phone number for the persona to receive calls on. > > **View Available Outbound Phone Number Mappings** > > | Index | Phone Number | > |-------|--------------| > | 1 | +14158734019 | > | 2 | +17199853850 | > | 3 | +17199853656 | > | 4 | +17196219208 | > | 5 | +17194630332 | > | 6 | +17194630202 | > | 7 | +17194630116 | > | 8 | +17194510465 | > | 9 | +16309315617 | > | 10 | +16309190593 | > | 11 | +16306014871 | > | 12 | +16305857118 | > | 13 | +16305526080 | > | 14 | +16305222063 | > | 15 | +16304468895 | > | 16 | +12624037199 | > | 17 | +12623988133 | > | 18 | +12622149045 | > > ## Advanced Configuration ### Emotional Voice Simulation Emotional tone in voice simulations is controlled through the **Persona Characteristics** prompt. These instructions guide the persona's dialogue generation, shaping word choice, sentence structure, and phrasing to convey emotion. The text-to-speech engine then speaks that text. > **Info:** **How emotion works in voice simulations:** The persona prompt controls what *text* the persona generates, not the voice itself. For example, instructing the persona to "be impatient" results in shorter sentences, more direct language, and frustrated phrasing. The TTS engine does not have direct emotion controls — it speaks whatever text the persona produces. Emotional impact comes from the words and sentence structure, not from changes in vocal tone or volume. #### Best Practices for Emotional Personas **Be specific and descriptive.** Instead of generic labels, describe the emotional behavior in terms of word choice and conversational patterns: ``` // Less effective You are an angry customer. // More effective You are extremely frustrated and losing patience. You use short, clipped sentences. When you have to repeat information you've already provided, you use phrases like "I already told you this" and "This is unacceptable." Your language becomes sharper and more aggressive as the conversation goes on if the agent cannot resolve your issue quickly. ``` **Use punctuation to signal emotion.** The text-to-speech engine interprets punctuation as speech cues: - Exclamation marks (`!`) convey urgency or emphasis - Commas create natural pauses and hesitation - Dashes (`-`) create brief breaks - Short sentences convey impatience or stress - Question marks with exclamation marks (`?!`) convey disbelief > **Warning:** Avoid using ellipses (`...`) for pauses. Some TTS engines read them aloud as "dot dot dot." Use commas or dashes instead to create natural pauses. **Include emotional progression.** Real callers escalate or de-escalate: ``` You start the call calmly but become increasingly frustrated if the agent asks you to repeat information or puts you on hold. If the agent resolves your issue, your tone should soften. If the agent is dismissive, you become more insistent and demand to speak to a supervisor. ``` #### Voice Selection for Emotional Scenarios Higher-realism voices generally produce better emotional expressiveness. If emotional nuance is important for your test scenarios, consider selecting a higher-realism voice for your persona. > **Warning:** Higher-realism voices have a concurrency limit of approximately 12 simultaneous connections. For high-volume simulation runs where emotional expressiveness is less critical, use higher-concurrency voices to avoid bottlenecks. #### Example Emotional Personas **Stressed Customer** ``` You are a customer under significant time pressure. You are calling during your lunch break and need this resolved quickly. Keep your responses very short and direct. You mention the time frequently and say things like "Can we speed this up?" and "I really don't have much time." If the agent asks unnecessary questions, respond with impatience: "Is that really necessary right now?" ``` **Impatient Elderly Caller** ``` You are an older adult who is not comfortable with technology. You are calling because you cannot figure out the website. You are somewhat impatient and repeat yourself when you feel you're not being understood. You become flustered when given too many steps at once and say things like "That's too many things at once" or "Can you just do it for me?" You occasionally go on brief tangents about how things used to be simpler. ``` **Upset but Polite** ``` You are disappointed with the service you received but remain polite throughout the call. You express frustration through pointed questions rather than harsh language. Use phrases like "I'm quite disappointed" and "I was really expecting better." You give the agent a fair chance to resolve the issue but make it clear that your patience has limits. ``` ### Filler Words and TTS Behavior When configuring personas to use filler words like "um", "uh", or "hmm", the way these words are written in the persona's speech directly affects how the text-to-speech engine pronounces them. Text-to-speech engines process text literally. Unusual spellings or excessive repeated letters can cause the engine to spell out letters individually, read punctuation marks aloud, or mispronounce unfamiliar character sequences. #### TTS-Friendly Filler Words Use these standard spellings, which are recognized by text-to-speech engines: | Use This | Avoid This | Why | |----------|-----------|-----| | `um` | `ummm`, `ummmm` | Extra letters may be spelled out | | `uh` | `uhhh`, `uhhhhh` | Extra letters may be spelled out | | `hmm` | `hmmmmm`, `hmmmmmm` | Extra letters may be spelled out | | `oh` | `ohhh`, `ohhhh` | Extra letters may be spelled out | | `ah` | `ahhh`, `ahhhh` | Extra letters may be spelled out | | `well,` | `well...` | Ellipses may be read as "dot dot dot" | | `so,` | `so...` | Ellipses may be read as "dot dot dot" | | `you know,` | `you know...` | Ellipses may be read as "dot dot dot" | #### Recommended Persona Prompt for Filler Words Include explicit TTS-friendly instructions in your persona prompt: ``` You use natural filler words in conversation. When hesitating, use only these words: "um", "uh", "hmm", "oh", "well". Write them as single short words. Use commas for pauses instead of ellipses or repeated letters. Example: "Um, I think the order number is, uh, let me check, it's 12345." ``` ### Conversation Triggers You may want the persona to remain silent until the agent says a specific word or phrase, such as waiting for a greeting before starting to speak. The persona's behavior is driven by the instructions in the persona prompt. You can instruct the persona to wait for specific phrases, but because the underlying language model is probabilistic, adherence is not 100% deterministic. #### Maximizing Trigger Reliability 1. **Set the Conversation Initiator** to "Persona waits to speak" so the agent always speaks first. 2. **Use strong, repeated language** in the persona prompt: ``` CRITICAL INSTRUCTION: You MUST remain completely silent until the agent says "How can I help you today?" Do not speak. Do not respond to any other greeting or introduction. Wait specifically for the phrase "How can I help you today?" before saying anything. Any other phrase like "How may I assist you?" or "What can I do for you?" should NOT trigger your response. Continue waiting silently. ``` 3. **Keep the trigger phrase simple and distinctive.** Shorter, more common phrases are easier for the persona to reliably detect. 4. **Include fallback behavior** for cases where the exact phrase doesn't appear: ``` If the agent does not say "How can I help you today?" within the first 30 seconds, you may begin speaking with your objective. This prevents the conversation from stalling entirely. ``` > **Warning:** Conversation triggers provide high but not perfect consistency. For mission-critical trigger behavior, consider running multiple simulations to account for natural variation. ### Interruption Behavior Voice simulations involve two distinct types of interruption behavior: #### Proactive Interruptions The **Interruption Rate** setting (None, Low, Medium, High) controls whether the persona deliberately interrupts the agent on a timer. When set to None, the persona never proactively talks over the agent. | Setting | Behavior | |---------|----------| | **None** | No proactive interruptions | | **Low** | Interrupts approximately every 90 seconds | | **Medium** | Interrupts approximately every 45 seconds | | **High** | Interrupts approximately every 30 seconds | #### Natural Turn-Taking Overlap Even with Interruption Rate set to None, you may observe the persona starting to speak while the agent is still talking. This is caused by **natural voice turn-taking**, not proactive interruptions. In real phone conversations, speakers rely on pauses, intonation changes, and context to determine when the other person has finished speaking. The simulation's speech-to-text engine detects pauses in the agent's speech and may interpret a brief pause as end-of-turn, causing the persona to begin responding before the agent has fully finished. This behavior is realistic and expected in voice simulation testing, as it mirrors how real callers sometimes talk over agents. #### Reducing Turn-Taking Overlap If you need the persona to be more patient and avoid any overlap: 1. **Add explicit waiting instructions** to the persona prompt: ``` Always wait for the agent to completely finish their thought before responding. Take a brief pause after the agent stops speaking before you begin your response. If you hear the agent start speaking again, stop immediately and let them finish. ``` 2. **Use longer, more deliberate speech patterns** in the persona characteristics to naturally slow the response: ``` You are a thoughtful, patient caller who considers the agent's words carefully before responding. You take a moment to think before speaking. ``` ## Personas vs. Test Sets Personas and test sets serve distinct purposes and work together in simulations. ### Personas: Define HOW to Behave Personas establish behavioral traits applied across multiple test sets: - "You are polite and friendly, respond in short sentences." - "You speak slowly with natural pauses like 'uhh' and 'umm'." - "You are impatient and frequently interrupt." ### Test Sets: Define WHAT to Do Test sets contain specific instructions for the conversation: - "Call to get a refund for order #12345" - "Ask for PTO from March 21st to 22nd" - "Inquire about account balance" ### Why Keep Them Separate? **Reusability**: Apply one persona to multiple test sets, or test one scenario with multiple personas. **Comparison Testing**: Run the same test set across different personas to evaluate agent handling of various user types. **Easier Maintenance**: Update behavioral traits in one place without affecting test scenarios. ## Best Practices **Recommended:** ``` Persona: "You are a friendly customer who speaks in short sentences." Test Set: "Call to cancel your subscription." ``` **Avoid mixing behavioral traits with task instructions:** ``` Test Set: "You are a rude customer who wants to cancel subscription and argues about fees." ``` ## Custom Persona Prompts Include in your custom persona prompt: - **DTMF/IVR handling**: Navigation instructions for phone menus - **Speech style**: Filler words, response patterns - **Information flow**: When to provide or withhold information - **Call ending triggers**: Conditions for hanging up ### Voice Persona Example ``` You are a customer calling support. - WAIT for all options before proceeding - Use dtmf tool to select menu options - Remain silent during IVR navigation - Natural responses with occasional pauses - Only respond when directly asked Hang up if transferred to a human agent. ``` ### Chat Persona Example ``` - Wait for the automated greeting before typing - Respond naturally to prompts - Natural chat language with occasional typos - Concise responses unless asked for details ``` ## Template Strategy 1. Create persona variations for different user types 2. Create focused test sets for specific workflows 3. Combine in templates for comprehensive testing 4. Analyze results across user personalities > **Tip:** Start with 2-3 core personas (Polite Customer, Impatient Customer, Technical Customer) and build test sets around common workflows. --- ## Test Sets Source: https://docs.coval.ai/concepts/test-sets/overview Tell our simulated users what to do, say, and how to behave. A **Test Set** is a structured collection of **test cases** designed to evaluate specific functionalities, workflows, or scenarios in your project. Each test set can contain multiple test cases, and simulations/evaluations will analyze the aggregate results of all test cases within the set. # How to Generate a Test Set ## Quick Start 1. **Enter your test scenario** in the input box. 2. **(Optional) Add extra context**: - Attach files (such as text, JSON, or markdown) - Choose an agent to evaluate - Pick a relevant category from those suggested 3. **(Optional) Add metadata**: - Define metadata fields to extract from each test case. - Example: key: "ticket_number", description: "X-###" will generate entries like "X-001" per test case - Example: key: "destination", description: "enter a possible airport code the user is flying to" will generate entries like "SFO" 4. **Submit** using the arrow button or by pressing Enter. 5. **Review and modify** your test set in the test set editor. ## Alternative Options - **Upload from file**: Use "Upload from file" to import CSV/Excel test cases - **Manual mode**: Use "Use manual creation mode" to create a blank test set and add cases yourself > **Tip:** **Tips for better test cases:** > - Be specific in your description for better test cases > - Attaching agent prompts or documentation helps generate more relevant tests > - You can edit, add, or remove test cases after generation ## Uploading from CSV/Excel Import test cases in bulk by uploading a properly formatted CSV or Excel file. ### Column Structure The test case input or prompt. This column is case-insensitive and must be present in your file. Expected behaviors for the test case. Parsing rules (applied during test-set ingest/validation): - **JSON array**: `["behavior1", "behavior2"]` - parsed as an array of behaviors - **Comma-separated string**: `"behavior1,behavior2"` - split by comma into multiple behaviors - **Single string**: `"behavior1"` - treated as a single behavior string Test case type. Case-insensitive. Accepts: - `SCENARIO` - `TRANSCRIPT` JSON object containing test case metadata Agent IDs to associate with the test set. Test-set level (applies to all test cases): - **JSON array of strings**: `["agent-id-1", "agent-id-2"]` - parsed as an array of agent ID strings - **Comma-separated string**: `"agent-id-1,agent-id-2"` - Values are trimmed and empty values are filtered out - Uses the first non-empty value found in the file (since `agent_ids` applies to the whole test-set) Knowledge base entries to attach to test cases: - **JSON array of objects**: `[{"id": "entry-id-1", "type": "web_url"}, {"id": "entry-id-2"}]` - each object can have `id` (required) and `type` (optional) - **JSON array of strings**: `["entry-id-1", "entry-id-2"]` - treated as entry IDs with default type - **Comma-separated string**: `"entry-id-1:web_url,entry-id-2,entry-id-3:pdf"` - Each object can be formatted as 'id:type' or just 'id' with a default type - **Single string**: `"entry-id-1"` - treated as an entry ID with a default type - Accepts: - `web_url` (default) - `plain_text` - `json` - `zendesk` - `shelf` - `file` Any additional column headers will automatically be treated as metadata fields ### File Requirements Your file must meet the following criteria: - Accepted formats: `.csv` or `.xlsx` - Maximum file size: 10MB - First row: Must contain column headers (case-insensitive) - Empty rows: Automatically skipped during import - Validation: Rows with empty input values are filtered out > **Warning:** Ensure your file doesn't exceed 10MB and contains at least one row with a valid `input` value. # Understanding Test Cases ## Test Case Input Each test case uses one of three input types that determine how the simulated user behaves during a run: | Type | What it is | Simulated user behavior | |------|-----------|------------------------| | **Scenario** | High-level intent | Improvises freely toward the goal | | **Transcript** | A reference conversation | Adapts as needed to match the flow | | **Script** | Exact turns | Follows them precisely, word for word | ### 1. Scenarios Define specific tasks or behaviors for your simulated user. Use quotation marks for exact phrases you want them to say. Examples: - Simple task: "Call to get a refund" - Complex scenario: "First, ask for PTO from the 21st to the 22nd of March. After receiving a confirmation, ask to change to the 20th to 22nd. During the verification, share your email address as 'emily [at] gmail [dot] com'. Then, proceed to correct yourself with 'oh no - it's actually emily [dot] marc [at] gmail [dot] com'." The more detailed your scenario, the more precisely our simulated user will follow it. ### 2. Transcript Recreate specific conversations using OpenAI transcript format. The agent will follow the user's part of the transcript as closely as possible. Format example: ```json [ { "role": "assistant", "content": "Welcome to X Restaurant. How may I assist you today?" }, { "role": "user", "content": "I would like to order some pizza." } ] ``` ### 3. Audio Upload Upload a pre-recorded audio file containing the persona's side of the conversation (right channel) to use during a voice simulation. Instead of the persona generating responses with an LLM and TTS, the uploaded audio plays back exactly as recorded — making the test fully deterministic. Supported formats: `.wav`, `.mp3` (max 200 MB, duration 5 seconds – 1 hour). **How it works:** 1. In the test set editor, select **Audio** as the input type and upload your audio file containing the persona's speech (right channel only) 2. The file is played back during simulation in place of LLM-generated persona speech 3. The uploaded audio is automatically transcribed so persona turns still appear in the transcript 4. After the audio finishes playing, the simulation waits 30 seconds for the agent to finish responding, then ends the call > **Tip:** Audio upload test cases are ideal for regression testing — record a specific caller interaction once, then replay it across agent updates to detect regressions in handling. #### Ground Truth Transcript To measure your agent's STT accuracy against a known-correct transcript of the uploaded audio, you can provide a ground truth transcript in two ways: **Via the UI** — when uploading an audio file, the modal includes a ground truth transcript field where you can either paste the transcript as plain text or upload a `.txt` or `.json` file. **Via metadata** — add a `ground_truth_transcript` key to the test case metadata directly. Either method enables the [STT Word Error Rate (Audio Upload)](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) metric, which compares your agent's speech-to-text output against this reference text. The ground truth can be plain text, labeled text with timestamps and role labels, or a JSON object with a `messages` array. ### 4. Script Define an ordered list of exact lines for the persona to deliver, turn by turn. The persona follows the script exactly rather than generating responses with an LLM — while still using the configured persona voice and background sounds. Example script turns: 1. "Hi, I'd like to check my account balance." 2. "Yes, my account number is 12345." 3. "Thank you, goodbye." **How it works:** 1. In the test set editor, select **Script** as the input type 2. Add ordered turn texts in the script editor (each turn is one persona utterance) 3. During simulation, the persona delivers each line in order instead of generating LLM responses 4. A divergence detector monitors agent responses — if the agent diverges significantly from the expected flow, the simulation can end early with a `SCRIPT_DIVERGED` reason 5. After the last scripted turn is delivered, the agent gets one final response before the simulation ends with a `SCRIPT_COMPLETED` reason > **Tip:** Script test cases give you deterministic persona speech output while still exercising the full voice pipeline (TTS, turn-taking, background noise). Use them when you need control over exactly what the persona says but still want realistic audio delivery. ### 5. Image Attachment Attach a single image to a test case so the persona can share it during a WebSocket voice simulation. This is useful for flows like sending a receipt, damage photo, insurance card, or product image after the agent asks for visual proof or context. Supported formats: `.png`, `.jpg`, `.jpeg` (max 2 MB, one image per test case). Before using image attachments, make sure the test set is attached to a **WebSocket voice agent**. The image will only be sent when this test set is used with that agent type. **How it works:** 1. In the test set editor, open a test case and click **Add Media**. 2. Upload a PNG or JPEG image and give it a short **Name** such as `receipt_photo` or `broken_screen`. 3. Optionally add a **Description** telling the persona when the image should be sent. 4. Attach the test set to a **WebSocket voice agent** with a media send template configured. 5. Launch the run using that attached WebSocket voice agent. 6. During the conversation, Coval can send the image when the agent asks for relevant visual information. > **Info:** Image attachments augment a normal test case input rather than replacing it. You still define the scenario, transcript, script, or audio flow as usual, and the image is available as an additional artifact the persona can send when needed. > **Warning:** Image attachments currently work only when the test set is attached to a WebSocket voice agent and used in a voice simulation. Other simulator types do not send attached images. **Best practices:** - Use short, stable names like `receipt_photo` or `drivers_license_front`. - Use the description to explain when to send the image, not just what the file contains. - Keep the image tightly scoped to the task so the agent receives only the evidence it needs. ## Test-Case Specific Evaluation Expected Behavior and Metadata allow you to utilize test-case specific data to evaluate how the agent responds to a specific test case. ## Test Case Expected Behavior The expected behavior dictates how your agent should be responding to the user's requests. Example - "the agent should ask the user for their phone number" - "the agent should repeat the phone number back to the users" Use the **Composite Evaluation** metric to evaluate whether the agent followed the expected behaviors. Configure it with **From Test Case** as the criteria source to automatically pull behaviors from each test case. With **Percentage of Criteria Met** reporting, the example above would return 0.5 if the agent asks for the phone number but does not repeat it back. ## Test Case Metadata These fields can be used to store specific metadata about a test case. This is helpful when you want to create a metric that might reference a specific aspect of the test case. You can input as key/value pairs, or as a JSON. Example: Imagine an airline help desk where the test case contains this metadata ```json { "source": "LAX", "destination": "SFO" } ``` Then, you can write, for example, a binary Destination Identification Metric with the question: Did the agent correctly identify the destination as: `{{test_case.destination}}`? ## Recommended Test Set Types > **Info:** For comprehensive testing, create multiple types of test sets: > > - **Regression Set**: Contains "happy path" scenarios representing typical successful interactions > - **Adversarial Set**: Contains edge cases and scenarios designed to test your agent's limits and handling of unusual requests ## Utilizing Agent Attributes In your agents, you can set specific attributes associated with that agent. You can embed these agent attributes into your scenarios with this format: `{{agent.attribute_name}}` Example: Imagine one agent has the attribute **location** with a value "San Francisco", and another agent has the value "London". Embed those agent attributes in your scenarios and expected behaviors like this: Scenario: You are a user calling for travel recommendations in `{{agent.location}}` Expected Behavior: The agent should only give travel recommendations in `{{agent.location}}` ## Test Cases vs. Personas - **Persona**: Defines **how** to behave - Characteristics (friendly, angry) - Voice configuration - Can be assigned multiple test cases - **Test Case**: Defines **what** to do - Specific tasks or scenarios - Can be assigned to any persona --- ## Metrics Source: https://docs.coval.ai/concepts/metrics/overview Understand and analyze your AI agents' performance with Coval's comprehensive metrics ## What is a metric? Metrics give you quantitative insights into your agent's performance, allowing you to see red flags early and understand overall trends. Each metric assesses your agent in a different way. **Audio** metrics use recordings, either simulated or live, to detect interruptions, measure phonemes per second, assess latency, and more. **LLM Judge** metrics provide answers to specific questions you have about your transcripts, allowing you to dial in on your unique specifications. LLM Judge metrics can optionally include **Trace Context** — when enabled, the judge automatically receives a summary of the agent's OpenTelemetry spans alongside the transcript, enabling evaluation of tool usage, execution order, and behavior that isn't visible in the transcript alone. Other offerings include **Sentiment Analysis**, **Regex Matching**, and many more. While Coval provides built-in metrics (latency, accuracy, tool-call effectiveness, instruction compliance), you can create custom metrics tailored to your specific needs. All out-of-the-box metrics are marked as “Built-in” in your Metrics list. These metrics can be applied to Simulated Conversations as well as Live-Monitoring Conversations. ## Recommended Metrics These are the metrics we usually recommend starting with, if you want to use built-in metrics as a starting point: - **Conversational LLM Judge (Binary) Metrics:** - Composite Evaluation - Agent Repeats Itself - End Reason - **Audio-Metrics:** - Latency - Interruption Rate - Speech Tempo - Volume/Pitch Misalignment - **Other:** - Workflow Verification: - You can generate a workflow in the Agent creation flow, this metric will re-trace the workflow in the transcript and detect off-path behavior. ## Advanced Metrics: For when you try to evaluate specific parts of the conversation: - **Binary Tool Call metrics**: - Check if your tool calls (functions) have been performed correctly - **Audio LLM Judge:** - Ask an LLM Judge question and, instead of evaluating the transcript, we'll evaluate the audio. (e.g. "Did the assistant stutter?") - **Categorical metrics**: - Define a set of categories/topics to filter topics of your conversations (good for exploratory call analysis) - **Transcript Regex Match:** - A metric that performs regex pattern matching on conversation transcripts. Returns 1 for a match and 0 for no match. You can filter by speaker role, check only the first or last message, require that a pattern is absent (for compliance rules like “agent must not say X”), and enable case-insensitive matching. Ideal for exact phrase detection, compliance checks, and format validation without needing LLM calls. - **Numerical LLM Judge:** - A metric that uses an LLM judge to evaluate a prompt and output a numerical score. - **Tool Call Latency:** - Used to measure the latency of tool calls. - **Metadata Field Metric (Conversations-only):** - If you send metadata as part of your transcripts to evaluate with Coval, this metric will take the specific metadata field's value and output that result as a metric result. Supports string, float, and boolean field types. - **Custom Trace Metric:** - Extract a value from your agent's OpenTelemetry spans and aggregate it across all matching spans in a simulation. Use this to track custom latency signals, confidence scores, tool call durations, token totals, span counts, error rates, or any other signal your agent emits. See the [Custom Trace Metrics guide](/concepts/metrics/custom-trace-metrics) for details. - **Custom**: If you have your own metrics that you want to upload to the Coval platform to run next to our built-in metrics, let us know. \_Note: This is just an excerpt of Coval’s built-in Metrics. More metrics can be found in the Metrics overview list on the platform. \_ ## Guide to Creating Binary LLM Judge Metrics When creating metrics that use an LLM to evaluate performance: - Be precise in your descriptions - Always refer to the agent as "the assistant" for clarity - Provide clear guidance on evaluation criteria > **Info:** **Example: "Avoid Unresponsiveness" Metric:** > > _Given the transcript, did the assistant maintain responsiveness by acknowledging all user inputs and avoiding behaviors that make the user question whether the assistant is still present?_ > > _Return YES if:_ > > _• The assistant responds promptly and appropriately to all user inputs_ > > _• There are no long silences, skipped questions, or ignored user messages_ > > _• The user does not need to ask "Are you still there?" or similar prompts_ > > _• If the assistant is uncertain or processing, it states that clearly (e.g., "Let me check that for you")_ > > _Return NO if:_ > > _• The assistant fails to respond to a user input_ > > _• The user asks "Are you still there?" or expresses concern about being ignored_ > > _• The assistant gets stuck or goes silent without explanation_ ### Improve your Metrics To refine a metric, open it from the metrics list and click “Improve Metric.” Select a test set (must be a transcript—tip: copy/paste a simulated transcript into a new set). You can then iterate on the metric’s formulation and see how often it returns YES vs. NO. This helps reduce noise and non-determinism in LLM-judge metrics. ### Custom Metrics > **Info:** Need custom metrics tailored to your needs? Contact us, and we’ll create them > for you. --- ## Built-in Metrics Source: https://docs.coval.ai/concepts/metrics/built-in-metrics Comprehensive guide to Coval's pre-built metrics for evaluating agent performance # Built-in Metrics Coval provides a comprehensive suite of built-in metrics to evaluate your AI agent's performance across multiple dimensions. These metrics are ready to use out of the box and cover audio quality, conversation flow, response timing, and more. ## Audio Quality Audio metrics evaluate the quality and characteristics of speech output from your agents. These are essential for voice-based applications and provide comprehensive analysis of audio fidelity, conversation flow, and speech characteristics. > **Note:** All metrics in this section require audio input to function properly. They > will not work with text-only transcripts. ### Background Noise **Purpose**: Measurement of audio clarity and background noise. **What it measures**: Ratio between speech signal strength and background noise, signal noise ratio (SNR). **When to use**: Audio quality assessment, identifying poor recording conditions. **How it works**: Compares the strength of speech signals against background noise by analyzing speech and silent (room tone) segments separately. The metric calculates the ratio between signal and noise power levels, providing both overall and segment-by-segment quality assessments. **How to interpret**: - SNR above 20 dB indicates excellent audio clarity. - SNR between 10-20 dB is acceptable for most applications. - SNR below 10 dB may significantly impact speech recognition and comprehension. ### Percent Audio Above 300Hz **Purpose**: Measures the share of voiced audio with fundamental frequency above 300Hz. **What it measures**: Percentage of voiced segments whose fundamental frequency (pitch) exceeds 300Hz. **When to use**: Comparing voice models or speech synthesis output on frequency distribution. **How it works**: Detects voiced segments in the audio and computes the fraction whose fundamental frequency is above 300Hz. > **Tip:** Need a different frequency threshold? > The **Audio Frequency Filter** custom metric lets you set any Hz value instead of the fixed 300Hz and reports the percentage of voiced segments above or below that threshold. Create one from the metric editor and configure `frequency_threshold`. ### Music Detection **Purpose**: Detects music segments in audio recordings. **What it measures**: Count of music segments detected in the conversation, with timeline data showing when each music segment occurs. **When to use**: - Detecting hold music or queue music during voice calls - Identifying unwanted background music in recordings - Measuring music duration and frequency in customer service scenarios - Debugging audio quality issues related to music interference **How it works**: Analyzes audio to identify non-speech segments and classify them as music. Returns a count of music segments and timeline entries with start/end offsets and duration for each detected music segment. **How to interpret**: - Value equals the count of distinct music segments detected - Higher counts indicate more frequent or longer music interruptions - Timeline subvalues show exact timestamps for each music segment - Useful for identifying when customers are placed on hold with music ## Audio Artifact Detection These metrics detect synthesis and recording artifacts in agent audio output. Where the Audio Quality metrics above measure signal-level properties (noise, frequency, duration), artifact detection looks for discrete failure modes — clipping, signal dropouts, TTS loops, unnatural phoneme stretching, rhythm irregularities, voice identity drift, and anomalous pauses — that degrade perceived naturalness or indicate a broken TTS pipeline. > **Note:** All metrics in this section require audio input. They are designed for voice agents using text-to-speech synthesis and will surface TTS-specific artifacts that general audio quality metrics do not detect. ### Speech Artifact Score **Purpose**: Composite quality score aggregating all artifact analyzers into a single 0–1 signal. **What it measures**: A weighted average of per-analyzer severity scores, inverted so that 1.0 = fully clean and 0.0 = severe artifact presence. Displayed with a quality tier badge in the UI. | Score | Tier | |---|---| | ≥ 0.8 | Good | | ≥ 0.6 | Fair | | ≥ 0.4 | Poor | | < 0.4 | Severe | **When to use**: Use this as your primary artifact health signal when you want a single number to track across runs or set an alert threshold on. Drill into the per-analyzer breakdown to identify which failure mode is driving the score. **How it works**: Runs all 8 analyzers and collects a 0–1 severity from each via sigmoid mapping calibrated for telephony audio. Severities are combined via weighted average; if an analyzer fails or is skipped (e.g., insufficient audio length), its weight is redistributed across the remaining analyzers so the score stays meaningful. **How to interpret**: - The per-analyzer breakdown is split into three sections: - **Quality** — Voice Quality shows a quality score (1.0 = natural, 0.0 = degraded) - **Severity** — Clipping, Dropout, Timbre Drift show severity (0.0 = clean, 1.0 = severe) - **Diagnostic Measurements** — Syllable Rate, Loop Detection, Phoneme Stretch, Pause Detection show severity scores - Waveform regions are color-coded by source: purple for signal artifacts (clipping, dropout), blue for speech anomalies (loop, phoneme stretch, syllable rate, timbre drift), teal for voice quality issues. - Opening a dropdown in the anomaly list filters the waveform to show only that analyzer's regions. Opening multiple shows the union. Closing all restores the full view. ### Signal Artifacts --- #### Clipping **Purpose**: Detect digital clipping caused by audio samples exceeding the signal ceiling. **What it measures**: The fraction of total samples inside clipping runs (0–1). Samples with absolute amplitude above 0.95 that persist for at least 5 consecutive samples are counted. **When to use**: When agents sound distorted or "crackly," or when you suspect TTS output levels are misconfigured. Common in pipelines where gain staging is not controlled. **How it works**: Scans the raw (non-normalized) audio for consecutive samples above the amplitude threshold (0.95). A minimum run length of 5 samples filters out isolated spikes. Per-event severity scales linearly with how far the peak exceeds the threshold. **How to interpret**: - Any clipping fraction above ~0.01 (1%) typically produces audible distortion. - High clipping severity combined with low dropout severity points to a gain staging issue upstream of your TTS provider. --- #### Dropout **Purpose**: Detect brief signal interruptions where audio unexpectedly drops to near-silence. **What it measures**: The density of dropout events per minute of audio. A dropout is a 50–200 ms window where the signal falls ≥25 dB below the local baseline with a steep onset edge (≥20 dB within 2 ms). **When to use**: When callers report choppy audio or cut-out moments, or when you see unexplained gaps in waveform visualizations. Also useful for detecting network-induced packet loss on telephony integrations. **How it works**: Computes a rolling local energy baseline and flags windows where RMS drops sharply. The steep-edge requirement distinguishes dropouts from natural pauses. Events are filtered to the 50–200 ms duration range to avoid conflating dropouts with silence or pauses. **How to interpret**: - Isolated events (< 0.5/min) are often benign; rates above 1–2/min produce noticeably choppy audio. - Short, steep dropouts that cluster in time suggest a network or buffering issue; distributed dropouts suggest a TTS rendering problem. --- #### Codec **Purpose**: Detect codec compression artifacts introduced by audio encoding and decoding. **What it measures**: Mean MAD z-score across detected codec artifact events. Higher values indicate more anomalous spectral features (MFCC cosine distance, spectral flatness, spectral flux) relative to the call's own median. Per-event timeline severities are normalized to 0–1 by capping at a z-score of 10. **When to use**: When audio has a "muddy," over-compressed, or digitally degraded quality that isn't explained by clipping or dropout. Common when audio passes through low-bitrate codecs (e.g., G.711 on telephony) or multiple encode/decode cycles. **How it works**: Extracts codec-sensitive spectral features and applies MAD-based outlier detection across segments. Segments with feature values that deviate significantly from the call median are flagged. The raw z-score is normalized to 0–1 by capping at 10, preventing extreme outliers from distorting comparisons. > **Note:** Codec is excluded from the Speech Artifact Score composite because its raw z-scores can be very large, which would dominate the weighted average even after normalization. It is still surfaced in the per-analyzer breakdown and waveform regions. **How to interpret**: - High codec severity with low clipping and dropout suggests the artifact is encoding-induced, not a gain or signal issue. - Telephony integrations running G.711 or similar narrow-band codecs will have a naturally elevated baseline. ### Speech Anomalies --- #### Loop Detection **Purpose**: Detect repeated audio segments caused by TTS synthesis looping on a phrase or fragment. **What it measures**: Count of distinct loop patterns detected. Each loop must last at least 1.0 second, repeat at least 3 times, and have pitch correlation above 0.95 (confirming TTS origin rather than natural repetition). **When to use**: When agents occasionally repeat themselves verbatim in a way that does not match the conversation script, or when TTS audio sounds like it stuttered and replayed a segment. **How it works**: Computes MFCC (Mel-frequency cepstral coefficient) fingerprints across the audio and detects recurrence patches where fingerprints match closely. Pitch correlation above 0.95 between candidate segments is required as a confirmation gate, distinguishing true TTS loops from natural lexical repetition. **How to interpret**: - Any loop event above the minimum duration threshold (1.0 s) is worth investigating — true loops are almost never intentional. - Low MFCC similarity at or near the threshold indicates borderline matches. --- #### Phoneme Stretch **Purpose**: Detect unnaturally sustained phonemes where a single sound is held far longer than normal speech cadence. **What it measures**: Total duration of all phoneme stretch events in seconds. **When to use**: When TTS audio sounds like it is "freezing" on a syllable or vowel, which can occur when synthesis models encounter unusual input (long numbers, rare proper nouns, edge cases in SSML). **How it works**: Applies a quad-gate: a region must simultaneously satisfy voiced-segment detection, low pitch jitter (below 2% — TTS vocoders produce unnaturally stable pitch), stable fundamental frequency (F0 std < 15 Hz), and MFCC stasis (minimal spectral movement) to be flagged. All four conditions must hold to filter out natural expressive lengthening and prosodic emphasis. **How to interpret**: - Events under 0.5 s may be expressive prosody; events above 1.5 s are almost certainly synthesis artifacts. --- #### Syllable Rate **Purpose**: Detect rhythm irregularities indicating unnaturally mechanical or erratic speech pacing. **What it measures**: Syllable rate in syllables per second, along with rhythm diagnostics (nPVI, inter-syllable CV, and a composite irregularity score). The headline value is the raw syllable rate; the aggregate uses the rhythm irregularity score (0–1) for severity calculation. **When to use**: When agent speech sounds robotic or rushed, or when you want to validate that a new TTS model produces natural prosodic rhythm before deploying it. **How it works**: Measures variability between successive syllable durations (nPVI) and the coefficient of variation of inter-syllable gaps alongside absolute syllable rate. Scores are penalized when the rate falls outside the expected natural range (2.5–6.0 syl/s) or when rhythmic variability is abnormally low (monotone) or abnormally high (erratic). **How to interpret**: - A low irregularity score near 0 does not mean the speech sounds natural — it could indicate perfectly uniform (robotic) cadence. Examine both score and rate together. - High irregularity in combination with a rate above 6.0 syl/s suggests rushed synthesis; below 2.5 syl/s suggests sluggish or over-paused synthesis. --- #### Timbre Drift **Purpose**: Detect mid-call changes in voice identity — shifts in speaker timbre or pitch that make the agent sound like a different person. **What it measures**: Maximum cosine distance of speaker embeddings from an anchor reference (0–1). Higher values indicate greater drift from the original voice identity. F0 drift is tracked as an additional signal. **When to use**: When callers report that the agent's voice changed during a call, or when you suspect your TTS provider is switching voice models or applying inconsistent conditioning mid-session. **How it works**: Extracts ECAPA-TDNN speaker embeddings at regular intervals. Two references are maintained: a session anchor (first 15 seconds) and a rolling 30-second window centroid. Drift is flagged when either the anchor distance or rolling centroid distance exceeds the threshold, or when F0 deviates more than 30% from the anchor mean. Both embedding drift and pitch shift contribute to severity, with a multiplier when both signals fire simultaneously. **How to interpret**: - Anchor drift catches slow degradation (voice getting "tired"); rolling centroid drift catches sudden voice swaps. - Small drift values (< 0.2) are within normal TTS session variation; values above 0.3–0.4 are perceivable to most listeners. --- #### Voice Quality **Purpose**: Measure naturalness and vocal health of the agent's synthesized speech across four acoustic dimensions. **What it measures**: A composite naturalness score 0–1 (higher = more natural) combining four sub-metrics: | Sub-metric | Weight | What it captures | |---|---|---| | CPPS (Cepstral Peak Prominence Smoothed) | 40% | Voice clarity and breathiness — how well the harmonic structure stands above noise | | Jitter | 20% | Cycle-to-cycle pitch period stability — irregularity sounds like roughness or hoarseness | | Shimmer | 20% | Cycle-to-cycle amplitude stability — irregularity sounds like tremor or unsteadiness | | F0 variability | 20% | Pitch expressiveness — monotone speech with low F0 range scores poorly | **When to use**: When onboarding a new TTS provider or voice model, when monitoring for voice degradation over time, or when callers report the agent sounding robotic, breathy, or unsteady. **How it works**: Requires both raw and LUFS-normalized audio. CPPS is computed on the LUFS-normalized variant, which stabilizes the cepstral envelope for accurate breathiness measurement. Jitter and shimmer are computed on the raw variant using a 3-second sliding window with 1-second hops, so short-lived instabilities are captured without being diluted by longer clean segments. F0 variability is derived from pitch statistics across the call — both standard deviation and range must exceed thresholds for a healthy score. **How to interpret**: | Score | Tier | |---|---| | ≥ 0.8 | Excellent | | ≥ 0.6 | Good | | ≥ 0.4 | Fair | | < 0.4 | Poor | - CPPS carries the most weight (40%) and is the most sensitive indicator of breathiness and vocal fry. A CPPS below ~6 dB strongly suggests synthesis quality issues. - Jitter and shimmer thresholds are drawn from clinical speech pathology research on healthy vs. disordered voices. Each sub-metric scores 1.0 in the normal range, degrades linearly through the warning zone, and hits 0.0 at the instability ceiling: | | Normal (score 1.0) | Warning zone | Instability (score 0.0) | |---|---|---|---| | Jitter | < 1.04% | 1.04–2.0% | ≥ 2.0% | | Shimmer | < 3.81% | 3.81–6.0% | ≥ 6.0% | Well-tuned TTS systems should stay well under the normal floors. Values approaching the instability ceiling typically sound noticeably rough or unsteady to listeners. - F0 variability scores poorly when F0 standard deviation is below 15 Hz or pitch range is below 30 Hz — both must exceed their thresholds for a full score. - High jitter or shimmer without CPPS degradation often reflects expressive prosody rather than a synthesis defect. Look at all four sub-scores in the breakdown before drawing conclusions. --- #### Pause Detection **Purpose**: Detect anomalous pauses that are outliers relative to the agent's own baseline pause distribution. **What it measures**: Count of anomalous pause events, identified using MAD (Median Absolute Deviation) z-score thresholding against the distribution of all pauses in the call. The headline value is the count of outlier pauses. **When to use**: When callers report the agent "hanging" or going silent unexpectedly, or when you want to separate natural hesitation patterns from processing-delay artifacts. **How it works**: Identifies all silent gaps in agent speech and computes a MAD-based z-score for each, making the detector robust to non-Gaussian pause distributions. Pauses with a z-score above the threshold (3.0) are flagged as anomalous. This approach adapts to each call's natural rhythm rather than applying a fixed silence threshold. **How to interpret**: - The detector flags pauses that are statistical outliers within the call — a single very long pause in an otherwise fluent call will score higher than the same pause duration in a call with consistently slow pacing. - MAD z-score 3.0 roughly corresponds to pauses more than 3x the typical duration for that agent in that call. ## Conversation Length These metrics analyze the content and flow of conversations to ensure effective communication. ### Audio Duration **Purpose**: This metric measures the duration of the audio file in seconds. **What it measures**: Duration of the full conversation in seconds. ### Turn Count **Purpose**: Counts how many turns were taken in a conversation. **What it measures**: Each turn is a change between speakers. ### Words Per Message **Purpose**: The average number of words per message agent message. **What it measures**: Average number of words per message in a conversation. ### Speaking Time Percentage **Purpose**: Measures the percentage of total audio duration occupied by a selected speaker or silence. **What it measures**: The fraction of the call spent by the configured role, expressed as a percentage (0-100%). Configurable via the `role` parameter: - **Agent** — percentage of the call where the agent is speaking. - **Persona** — percentage where the customer/persona is speaking. - **Silence** — percentage of dead air (no speech or music). - **Music** — percentage of hold music or background music. **When to use**: Analyzing call composition — e.g., checking if the agent dominates the conversation, identifying excessive hold time, or measuring dead air. **How to interpret**: The four roles are complementary: agent + persona + silence + music = 100%. Create multiple instances with different roles to get a full breakdown. Audio regions are highlighted on the waveform timeline for each detected segment. ## Instruction Following These metrics measure how well the agent follows predefined behaviors. ### Workflow Verification Verifies if conversations follow expected workflow patterns and business logic. ## Resolution Evaluate the end of the conversation. ### End Reason **Purpose**: The reason that the conversation ended. **When to use**: Help identify patterns in call completion. > **Note:** This currently only works for simulations run on Coval. Support live calls are a work in progress. **Possible values:** | Value | Description | |-------|-------------| | `COMPLETED` | The conversation reached a natural conclusion with a successful resolution | | `MAX_TURNS` | The conversation reached the maximum allowed number of turns | | `MAX_DURATION` | The conversation exceeded the maximum allowed duration | | `USER_HANGUP` | The user ended the conversation (voice calls only) | | `AGENT_HANGUP` | The agent ended the conversation (voice calls only) | | `IDLE_TIMEOUT` | The conversation timed out due to inactivity (chat/SMS simulations) | | `ERROR` | An error occurred during the simulation | | `UNKNOWN` | The end reason could not be determined | > **Tip:** Want to define what counts as a successful end reason? > The **Successful End Reason** custom metric returns YES/NO based on whether the end reason matches your configured success criteria. Select one or more end reasons as your success conditions. ## Responsiveness Critial metric to identify if the agent is responding correctly. ### Agent Fails To Respond **Purpose**: Evaluate continuity and identif moments when the agent ignores or misses a user query. > **Warning:** Any occurrence of this metric indicates a critical failure requiring immediate > investigation. **What it measures**: Long silence from the agent between two consecutive user turns, and whether and when the agent eventually responds after the second user turn. **When to use**: Identifying moments when the agent ignored or misses a user query **How it works**: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks whether the agent resumes speaking after this. > **Tip:** Need a different silence threshold? > The **Agent Fails to Respond Delay** custom metric lets you configure `max_silence_duration_seconds` instead of the fixed 5-second default. ### Agent Needs Reprompting **Purpose**: Identifies when agents become unresponsive but will respond after user repetition. > **Note:** This metric helps identify edge cases where the agent's response mechanism may > be failing intermittently. **What it measures**: Long silence from the agent between two consecutive user turns, only ig the agent responds after the second user turn. **When to use**: Evaluating naturalness and continuity. Identifying moments when the agent ignores or misses a user query. **How it works**: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks if the agent resumes speaking after this. **How to interpret**: - Each silence gap and eventual response are collectively considered one event. - More events = worse responsiveness. > **Tip:** Need a different silence gap threshold? > The **Agent Reprompting Delay** custom metric lets you configure `min_silence_gap_seconds` instead of the fixed 2-second default. ### Agent Repeats Itself **Purpose**: Identifies instances where the agent says the same sentence or asks the same question multiple times. **When to use**: Evaluating naturalness and word choice, identifying diverse language. ## Timing & Latency Ensure timely agent interactions. ### Interruption Rate **Purpose**: The rate (interruptions per minute) that the user is interrupted by the assistant. **What it measures**: An interruption is defined as any time the user is speaking and the assistant starts speaking before the user has finished speaking. This does not include times that the user interrupts the assistant. **When to use**: Conversation flow analysis, identifying communication issues, training data for interruption handling. **How to interpret**: - High interruption frequency may indicate communication issues. - Interruption patterns can help identify conversation flow problems. - Useful for training agents to handle interruptions gracefully. ### Latency **Purpose**: Measurement of delays between user and agent response time in milliseconds (ms). **What it measures**: Time between user input and agent response, silence durations. **When to use**: Performance evaluation, identifying slow response times, conversation flow analysis. **How it works**: Analyzes the audio signal using Voice Activity Detection (VAD) to identify speaker transitions and measure the time delay between when a user finishes speaking and when the agent begins responding. The metric tracks these response times throughout the conversation to identify patterns and potential issues. **How to interpret**: - 3.5 seconds is the average latency of most agents in real-time conversations. - Higher latencies may indicate performance issues or processing bottlenecks. ### Time To First Audio **Purpose**: Detect audio start latency and responsiveness. **What it measures**: Time delay between simulation start and the first audible sound in the audio recording. **When to use**: Evaluating system or agent response latency before any speech begins. **How it works**: Detects the first audio frame that has RMS energy above a certain threshold and returns the timestamp of this frame. **How to interpret**: - \< 1000 ms: Fast audio start; considered responsive. - 1–3 seconds: Acceptable delay. - >3000 ms: Noticeable lag; may indicate issues in agent response, recording delay, or user hesitation. - -1 ms: No audio detected; likely a technical failure or silent recording. ### Speech Tempo **Purpose**: Identifies the rate of phonemes (perceptually distinct unit of speech sound) and high-speed speech periods. **What it measures**: The rate of phonemes per second (pps) in audio output. **When to use**: Speech quality assessment. Usefult to identify the average tempo. **How it works**: Measures the number of phonemes per interval over the duration of speech segment. **How to interpret**: - Above 20 PPS is too fast and will be hard to follow. - Between 15-20 PPS is fast but could be comprehensible. - Target is 10-15 PPS is not too fast or too slow. - Below 10 is too slow. ### Pause Analysis **Purpose**: Measures how frequently the agent pauses mid-speech and how long those pauses are. **What it measures**: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. **When to use**: - Identifying unnatural or excessive hesitations in agent speech - Detecting processing delays that manifest as in-speech pauses - Evaluating speech fluency across different configurations **How it works**: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. **How to interpret**: - Lower values indicate more fluent speech. - The detail view shows each pause with its timestamp and duration. - Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts. ## Trace Metrics These metrics use OpenTelemetry (OTel) trace data to measure the performance of individual components in your voice agent pipeline. They provide granular visibility into LLM, TTS, and STT service latencies, token consumption, and tool usage. > **Note:** All metrics in this section require your agent to send OpenTelemetry traces to Coval. > See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup instructions. > If traces are not configured, these metrics will report an error at execution time. ### LLM Time to First Byte **Purpose**: Measure LLM responsiveness by tracking how quickly the first token is returned. **What it measures**: Average time (in seconds) from when the LLM request is sent to when the first token is received, across all turns in the conversation. **When to use**: Identifying slow LLM providers, comparing model latencies, optimizing prompt length for faster responses. ### TTS Time to First Byte **Purpose**: Measure TTS responsiveness by tracking how quickly the first audio byte is produced. **What it measures**: Average time (in seconds) from when text is sent to the TTS service to when the first audio byte is returned, across all turns. **When to use**: Evaluating TTS provider performance, identifying bottlenecks in the audio generation pipeline. ### STT Time to First Byte **Purpose**: Measure STT responsiveness by tracking how quickly the first transcription result is returned. **What it measures**: Average time (in seconds) from when audio is sent to the STT service to when the first transcription result is received, across all turns. **When to use**: Evaluating STT provider performance, diagnosing why the agent is slow to start processing user input. ### LLM Token Usage **Purpose**: Track the total token consumption of LLM calls during a conversation. **What it measures**: Sum of input tokens and output tokens consumed across all LLM calls in the conversation. **When to use**: Cost monitoring, identifying conversations that consume excessive tokens, comparing prompt strategies for efficiency. **How to interpret**: - Token counts vary by model and use case. Track this metric over time to establish baselines for your specific agent. - Sudden spikes may indicate prompt injection, runaway tool loops, or excessively long conversations. - Use in combination with turn count to compute average tokens per turn. ### Tool Call Count **Purpose**: Count the total number of tool calls made during a conversation. **What it measures**: Number of tool call invocations detected in OTel trace spans. **When to use**: Verifying that the agent is using tools as expected, identifying conversations with excessive or insufficient tool usage. ### STT Word Error Rate **Purpose**: Measure the accuracy of your agent's Speech-to-Text by comparing it against Coval's own transcription of the same conversation. **What it measures**: Word Error Rate (WER) between your agent's STT output and Coval's reference transcript of the caller's speech. The **reference** (ground truth) is Coval's transcription of the persona's speech, generated automatically from each simulation. The **hypothesis** (what you're testing) is your agent's own STT output, read from the `transcript` attribute on OTel `stt` spans. **When to use**: Evaluating your STT provider's accuracy, comparing STT providers (e.g., Deepgram vs Whisper vs Google), diagnosing why your agent misunderstands users, or tracking STT quality over time. > **Note:** This metric requires your agent to emit OTel traces with the `transcript` attribute on each `stt` span. Coval also accepts the older `stt.transcription` alias, but new integrations should emit `transcript`. See [Instrumenting STT Spans](/concepts/simulations/traces/opentelemetry#instrumenting-stt-spans) for setup instructions. If your STT provider exposes utterance confidence, we also recommend sending `stt.confidence` on the same spans for debugging and provider-quality analysis. No manual ground truth or test data is needed — the reference transcript is generated automatically. **How to interpret**: A lower WER means your STT is more accurately capturing what the caller said. Compare this metric across runs to track STT quality over time, or across different STT providers to find the best fit for your use case. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally. For the WER formula and interpretation thresholds, see [Transcription Error](#transcription-error). ### STT Word Error Rate (Audio Upload) **Purpose**: Measure STT accuracy against a known-correct transcript that you provide, rather than Coval's auto-generated reference. **What it measures**: Word Error Rate (WER) between your agent's STT output and a ground truth transcript you supply in the test case metadata. The **reference** (ground truth) comes from the `ground_truth_transcript` field in your test case metadata. The **hypothesis** (what you're testing) is your agent's own STT output, read from the `transcript` attribute on OTel `stt` spans — the same as the standard [STT Word Error Rate](#stt-word-error-rate) metric. **When to use**: When you have [audio upload](/concepts/test-sets/overview#3-audio-upload) test cases with pre-recorded audio where you know exactly what was said. This lets you measure how accurately your agent's speech recognition transcribes a known recording — useful for regression testing STT quality against a canonical script. > **Note:** This metric requires two things: > 1. Your agent must emit OTel traces with the `transcript` attribute on each `stt` span. The older `stt.transcription` alias is accepted for compatibility, but `transcript` is canonical. See [Instrumenting STT Spans](/concepts/simulations/traces/opentelemetry#instrumenting-stt-spans) for setup. > 2. Your test case must include a `ground_truth_transcript` key in its metadata containing the reference transcript. See [Audio Upload — Ground Truth Transcript](/concepts/test-sets/overview#ground-truth-transcript) for details. > > If your STT provider exposes utterance confidence, we also recommend attaching `stt.confidence` to each `stt` span so low-confidence turns are easier to inspect alongside the WER result. **Accepted ground truth formats:** | Format | Example | |--------|---------| | Plain text | `"Hi, I'd like to check my account balance"` | | Labeled text with timestamps | `"[15.4s - 26.8s] PERSONA: Hi, I'd like to check my account balance"` | | JSON with `messages` array | Persona turns are extracted automatically — see snippet below | ```json { "messages": [ { "role": "user", "content": "Hi, I'd like to check my account balance" }, { "role": "assistant", "content": "Sure, I can help with that." } ] } ``` When the ground truth contains role labels (e.g. `PERSONA:`, `AGENT:`), only persona/user lines are used — agent lines are filtered out automatically. **How to interpret**: Same as [STT Word Error Rate](#stt-word-error-rate) — a lower WER means better STT accuracy. Because the reference transcript is your own known-correct text (not Coval's transcription), this metric isolates your STT provider's accuracy without any variance from the reference side. For the WER formula and interpretation thresholds, see [Transcription Error](#transcription-error). > **Tip:** **Custom Trace Metrics** — In addition to these built-in trace metrics, you can create your own custom trace metrics to measure any OTel span attribute emitted by your agent using the **Create Metric** button within the Metrics section of the Coval UI. ## Transcription Accuracy ### Transcription Error **Purpose**: Evaluate transcription accuracy through Word Error Rate (WER) — the percentage of words in the existing transcript that differ from a reference transcription Coval generates from the call audio. **What it measures**: $$ WER = \frac{S + D + I}{N} $$ Where: - **S** = substitutions - **D** = deletions - **I** = insertions - **N** = total number of words in the reference transcript **When to use**: - Detecting transcription quality regressions across runs or audio configurations - Comparing speech-to-text providers for the agent or persona side - Reporting WER for the agent, the caller, or both sides of a conversation - Measuring transcription accuracy on uploaded conversations **How it works**: Coval generates an independent reference transcription of the call audio and compares it to the existing transcript word by word. Each missing, inserted, or substituted word becomes an error and is surfaced as a word-level highlight in the transcript view (substitutions in yellow, deletions in red, insertions in blue). The metric automatically detects which speaker channel is which, so it works whether the audio was recorded by Coval or uploaded as a conversation. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `role` | `"agent"` | Whose turns to score. One of `"agent"`, `"persona"`, or `"both"` to measure agent and caller together. | | `min_reference_confidence` | `0.8` | Drop missing- or wrong-word errors where the reference recognizer's confidence is below this value (0.0–1.0). Filters out errors that are likely just reference uncertainty. Leave blank to disable. | | `min_substitute_similarity` | `0.8` | Drop substitution errors where the original and replacement words are at least this similar (0.0–1.0). Filters out spelling variants like "color"/"colour". Leave blank to disable. | The headline WER reflects errors **after** filtering, so the displayed number always matches the highlighted errors in the transcript view. **How to interpret**: - **WER < 0.10**: Excellent — clean audio with high transcription accuracy. - **WER 0.10 – 0.30**: Acceptable for most conversational agents and situations with background noise. - **WER > 0.30**: May significantly impact understanding of the audio. > **Tip:** See [Coval Benchmarks](https://benchmarks.coval.ai) for real-world WER performance data across different transcription providers and audio configurations. ## User Patterns ### Audio Sentiment **Purpose**: Detect vocal tone of each audio segment. **What it measures**: Emotional tone for each audio segment for both parties. **When to use**: General tone of the conversation and trend of audio sentiment across the conversation. **How it works**: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. **How to interpret**: Check frequency of certain emotional tones. > **Tip:** Want to set a pass/fail threshold on sentiment? > The **Preferred Audio Sentiment** custom metric lets you select which sentiments count as success, choose which speaker to evaluate (agent or persona), and set a minimum percentage of segments that must match. ### Transcript Sentiment Analysis **Purpose**: Analyzes the transcript for rude, polite, encouraging, and professional sentiments, identifying the sentiment with the highest overall score. **What it measures**: Score of emotional tone for each audio segment for the agent. **When to use**: General tone of the agent and how it could be interpreted. **How it works**: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. **How to interpret**: Higher scores in each sentiment indicate stronger sentiment detected. ## Best Practices for Using Built-in Metrics Begin with essential metrics like response time, resolution success, and audio quality before adding specialized ones. Establish baseline measurements before making changes to track improvement over time. Use multiple metrics together for comprehensive evaluation rather than relying on single indicators. Schedule regular metric reviews to identify trends and areas needing attention. ## Metric Selection Guide Choose metrics based on your use case: ### Voice Assistants - Audio Quality - Speech Tempo - Background Noise - Volume/Pitch Misalignment - Latency - Interruption Detection - Trace Metrics (requires OTel traces) - LLM Time to First Byte - TTS Time to First Byte - STT Time to First Byte ### Customer Service Bots - Composite Evaluation - Resolution Time Efficiency - End Resolution - Audio Sentiment ### Task Automation Agents - Workflow Verification - Composite Evaluation - Words Per Minute - LLM Token Usage - Tool Call Count ### General Conversational AI - Agent Response Times - Interruption Rate - Agent Repeats Itself - Transcript Sentiment Analysis - End Reason > **Note:** Remember that not all metrics are suitable for every scenario. Audio metrics > require actual audio input, while comparison metrics need reference data to > function properly. --- ## Custom Metrics Source: https://docs.coval.ai/concepts/metrics/prompting This guide provides instruction for creating high-performing custom prompting metrics in Coval's evaluation platform. Each metric type benefits from various prompting strategies to achieve reliable, deterministic results. For help writting prompts for the custom metrics Coval offers an `optimize metric` button to improve clarity and confidence. ## Core Principles for All Metrics ### 1. **Specificity Over Generality** - Define exact evaluation criteria rather than subjective assessments - Use concrete, measurable behaviors instead of abstract concepts - Provide clear boundary conditions for edge cases ### 2. **Role Consistency** - Always refer to the AI agent as "the assistant" - Use "the user" or "the customer" for human participants - Maintain consistent terminology throughout your prompts ### 3. **Deterministic Design** - Structure prompts to minimize LLM variance across evaluations - Provide explicit decision trees when possible - Define what constitutes partial vs. complete success --- ## Binary LLM Judge Metrics **Purpose**: Yes/No evaluations with high accuracy and consistency ### Prompt Structure Template ``` [CONTEXT SETTING] Given the transcript, [SPECIFIC QUESTION]? Return YES if: • [Explicit criterion 1] • [Explicit criterion 2] • [Explicit criterion 3] Return NO if: • [Explicit disqualifying condition 1] • [Explicit disqualifying condition 2] • [Edge case handling] [CLARIFICATIONS FOR EDGE CASES] ``` > **Tip:** **Important**: When using OR conditions, make it explicitly clear that the metric should return `YES`/`NO` if **any** of the conditions are met. Use "ANY of the following" language to remove ambiguity. > * "Return x if ANY of the following apply:" > * "[Condition] OR [Condition] OR [Condition] ... " ### Example 1: Issue Resolution Detection ``` Given the transcript, did the assistant successfully resolve the user's primary issue or concern? Return YES if ANY of the following apply: • The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you") • OR the assistant provides a complete solution and the user accepts it without further objection • OR the user indicates satisfaction with the outcome before ending the conversation • OR the assistant completes a requested action and the user acknowledges success • OR the user's question was fully answered and they don't ask follow-up questions about the same issue • OR the assistant provides complete, actionable guidance and the user indicates understanding • OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries) Return NO if ANY of the following apply: • The user states their issue remains unresolved • OR the conversation ends without addressing the user's main concern • OR the user expresses frustration or dissatisfaction with the proposed solution • OR the assistant escalates or transfers the issue without providing any resolution attempt • OR the user has to repeat their problem multiple times without progress • OR the assistant admits they cannot help or solve the user's problem • OR the user asks the same question again after receiving an answer ``` ### Example 2: Compliance Verification ``` Given the transcript, did the assistant properly collect all required verification information before processing the request? Return YES if: • The assistant gathered account number, full name, and security question answer • All three verification elements were confirmed before proceeding • The assistant explicitly stated verification was complete Return NO if: • Any of the three required elements (account number, name, security answer) were skipped • The assistant proceeded with the request before completing verification • Verification was attempted but failed, yet the assistant continued anyway If the user refuses to provide verification, return NO regardless of the reason. ``` ### Tips and tricks **Be Objective** **Recommended:** **Objective**: "Did the assistant acknowledge the user's concern within their first two responses?" **Avoid:** **Too subjective**: "Did the assistant provide good customer service?" **Single Focus** **Recommended:** **Singular observation**: - Create separate metrics for seperate obervations such as resolution and professionalism. **Avoid:** **Multiple criteria**: - "Did the assistant resolve the issue and maintain professionalism?" **Clear Logic** **Recommended:** **Use of clear logical operators** - Use AND/OR operators, ANY/ALL. **Avoid:** **Evaluation logic that contradicts the stated rules** - Metrics return incorrect results when the evaluation system checks for things that shouldn't trigger failures. - Such as requiring disclosure when no transfer occurred, or flagging live conversations as voicemails. #### Before (Poor Metric Example): ``` Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences? Return YES if: All three preference items were specifically inquired about. Return NO if: One or more items were not asked. ``` **Why this fails**: The metric has an "OR" condition in the question but requires "AND" logic in the evaluation, creating confusion about whether one or all conditions must be met. #### After (Improved Metric Example): ``` Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences? Return YES if: • The agent asked about preferred contact method, current service plan, AND billing preferences • This can be in a single question (e.g., "What's your preferred contact method, current plan, and billing preference?") OR separate questions for each item Return NO ONLY if: • The agent failed to ask about one or more of these three specific items: contact method, service plan, or billing preferences • Note: Focus on what the AGENT asked, not on what the customer mentioned in their response Examples of acceptable questions: • "How would you like us to contact you, what's your current plan, and how do you prefer to handle billing?" • Three separate questions covering each preference • Any variation that covers all three customer preference areas ``` > **Tip:** **Key improvements**: Clear AND/OR operators, explicit examples, and evaluation logic that matches the stated conditions. --- ## Categorical LLM Judge Metrics **Purpose**: Classification into predefined, mutually exclusive custom categories. ### Prompt Structure Template ``` Classify [SPECIFIC ASPECT] based on the conversation content. Decision Logic: • If [condition], classify as [CATEGORY_NAME] • If [condition], classify as [CATEGORY_NAME] • If [condition], classify as [CATEGORY_NAME] Return only the exact category name. ``` > **Note:** Note: Configure the category options and their definitions in the Coval UI category menu. The categories and their descriptions are set through the platform interface, not in the prompt text. ### Example 1: Call Intent Classification ``` Classify the primary reason for this conversation based on the user's needs and requests. Decision Logic: • If user mentions technical problems, errors, or "not working", classify as TECHNICAL_SUPPORT • If user mentions money, charges, bills, or payments, classify as BILLING_INQUIRY • If user wants to change account details or settings, classify as ACCOUNT_MANAGEMENT • If user asks general questions without specific issues, classify as GENERAL_INFORMATION • If user expresses dissatisfaction and requests escalation, classify as COMPLAINT_ESCALATION Return only the exact category name. ``` ### Example 2: Conversation Outcome Classification ``` Classify the final outcome of this conversation based on how it concluded. Decision Logic: • If user explicitly confirms resolution or satisfaction, classify as RESOLVED_SUCCESSFULLY • If solution provided but requires user action outside conversation, classify as PARTIALLY_RESOLVED • If conversation transferred to human agent, classify as ESCALATED_TO_HUMAN • If user ends conversation frustrated or without resolution, classify as UNRESOLVED_ABANDONED • If user asked questions and received answers without specific problems, classify as INFORMATION_PROVIDED Return only the exact category name. ``` --- ## Numerical LLM Judge Metrics **Purpose**: Score-based evaluations with consistent integer scaling. ### Prompt Structure Template ``` Rate [SPECIFIC ASPECT] based on the following criteria: Evaluation Criteria: • [Criterion 1 with behavioral indicators] • [Criterion 2 with behavioral indicators] • [Criterion 3 with behavioral indicators] Scoring Guidelines: Low scores: [Behavioral indicators for poor performance] High scores: [Behavioral indicators for excellent performance] Return only the numerical score. ``` **Note**: Configure the Min and Max score values as shown, not in the prompt text. The scoring scale (e.g., 1-5, 1-10) is set through the platform interface. ### Example 1: Empathy Assessment ``` Rate the assistant's empathy level based on the following criteria: Evaluation Criteria: • Acknowledgment of user emotions and concerns • Use of appropriate empathetic language and tone indicators • Validation of user feelings before moving to solutions • Adaptation of communication style to user's emotional state Scoring Guidelines: Low scores: No empathy shown, dismissive responses, purely transactional High scores: Clear empathetic responses, validates feelings, shows genuine concern Return only the numerical score. ``` ### Example 2: Technical Accuracy Scoring ``` Rate the technical accuracy of the assistant's information based on the following criteria: Evaluation Criteria: • Factual correctness of all technical statements • Completeness of technical explanations • Appropriate level of technical detail for the context • Identification and correction of any technical misconceptions Scoring Guidelines: Low scores: Major technical errors that could cause problems High scores: Expert-level accuracy with comprehensive, precise details Return only the numerical score. ``` --- ## Multimodal LLM Judge Metrics **Purpose**: Include audio-specific evaluations that text analysis cannot capture. Multimodal LLM Judge metrics analyze the audio along with the transcript text. This allows you to evaluate qualities like vocal tone, speech clarity, pacing, and emotional expression that are impossible to assess from text alone. > **Note:** The format of the Multimodal LLM judge metrics are the same as the LLM judge metrics. > Coval will handle the audio processing automatically, Your prompt should focus on **what you want to evaluate**, not how to process the audio. ### What Audio Metrics Can Detect Audio LLM Judge metrics excel at evaluating: | Category | Examples | | -------------------------- | -------------------------------------------------------- | | **Speech Quality** | Clarity, articulation, pronunciation, stuttering | | **Vocal Characteristics** | Tone, pitch, volume consistency, speaking pace | | **Emotional Expression** | Enthusiasm, frustration, sarcasm, empathy in voice | | **Professional Demeanor** | Courtesy, patience, confidence, nervousness | | **Speaker Identification** | Distinguishing between speakers, detecting interruptions | ### Prompt Structure Template ``` [SPECIFIC AUDIO QUESTION] Audio Analysis Criteria: • [Acoustic feature 1] • [Vocal characteristic 2] • [Speech pattern 3] Return YES if: • [Audio-specific condition 1] • [Audio-specific condition 2] Return NO if: • [Audio-specific disqualifier 1] • [Audio-specific disqualifier 2] Note: [Clarification about evaluation scope] ``` > **Tip:** **Writing Effective Audio Prompts**: Be specific about which speaker to > evaluate (assistant, user, or both) and what acoustic qualities matter. Vague > prompts like "Did it sound good?" produce inconsistent results. ### Transcript Scope for Audio Metrics Audio LLM Judge metrics support [Transcript Scope](#transcript-scope), allowing you to evaluate only specific portions of the audio. When you apply filters (such as agent-only or last N turns), the system automatically extracts and evaluates only the corresponding audio segments. This is particularly useful for: - Evaluating agent speech quality without user audio - Focusing on closing statements or greetings - Reducing token costs on long recordings ### Best Practices for Audio Metrics **Focus on Audio-Only Qualities** Only use Audio LLM Judge for evaluations that **require hearing the audio**. If something can be determined from the transcript alone (like whether specific words were said), use a standard LLM Judge metric instead - it's faster and more cost-effective. **Use Audio metrics for:** Tone of voice, speaking pace, pronunciation clarity, emotional expression, volume issues **Use Text metrics for:** Word choice, script compliance, information accuracy **Specify the Speaker Role** Always clarify whose audio you're evaluating: - "Did **the assistant** speak clearly..." - "Did **the user** sound frustrated..." - "Was there crosstalk between **both speakers**..." This prevents ambiguity when multiple voices are present. **Define Concrete Audio Criteria** Replace subjective terms with specific, observable audio qualities: | Avoid | Use Instead | |-------|-------------| | "Good tone" | "Calm, even-paced tone without audible frustration" | | "Clear speech" | "Words pronounced distinctly without mumbling or slurring" | | "Professional" | "Business-appropriate volume and pace, no sighing or dismissive inflections" | **Include Reasoning Guidance** For complex evaluations, ask the model to consider specific aspects before making a determination. This improves accuracy: ``` Before making your determination, consider: 1. What is the overall vocal tone throughout the call? 2. Are there any moments where the tone shifts notably? 3. How would a customer likely perceive this tone? ``` ### Example 1: Speech Clarity Assessment ``` Did the assistant speak clearly and at an appropriate pace throughout the conversation? Audio Analysis Criteria: • Pronunciation clarity and articulation • Speaking pace (not too fast or slow for comprehension) • Volume consistency and audibility • Absence of mumbling, slurring, or rushed speech Return YES if: • All words are clearly pronounced and easily understood • Speaking pace allows for comfortable comprehension • Volume remains consistent and audible throughout • No instances of unclear or garbled speech Return NO if: • Words are frequently mumbled, slurred, or unclear • Speaking pace is too fast or slow for easy comprehension • Volume fluctuations make parts difficult to hear • Any portions of speech are unintelligible due to clarity issues Note: Focus only on the assistant's speech clarity, not content quality. ``` ### Example 2: Professional Tone Detection ``` Did the assistant maintain a professional vocal tone throughout the conversation? Audio Analysis Criteria: • Tone consistency and appropriateness for business context • Absence of inappropriate emotional expressions (anger, frustration, sarcasm) • Professional demeanor in vocal inflection and manner • Respectful and courteous vocal presentation Return YES if: • Vocal tone remains professional and business-appropriate throughout • No instances of unprofessional vocal expressions or attitudes • Tone conveys respect and courtesy consistently • Emotional responses, if any, are appropriate to the context Return NO if: • Vocal tone becomes unprofessional, dismissive, or inappropriate • Clear instances of anger, frustration, or sarcasm in voice • Tone suggests disrespect or lack of courtesy • Emotional vocal responses inappropriate for professional context Note: Evaluate vocal tone and manner, not the words spoken. ``` ### Example 3: Empathy Detection ``` Did the assistant demonstrate vocal empathy when the user expressed frustration or concern? Audio Analysis Criteria: • Softening of tone when user expresses negative emotions • Appropriate pacing adjustments (slowing down to show care) • Warm, understanding vocal quality rather than robotic or dismissive • Verbal acknowledgments delivered with genuine-sounding concern Return YES if: • Assistant's tone audibly softens or warms in response to user distress • Pacing adjusts appropriately to show the assistant is listening • Voice conveys genuine concern rather than scripted responses • No rushing through empathetic statements Return NO if: • Assistant maintains the same tone regardless of user's emotional state • Empathetic words are delivered in a flat, robotic, or rushed manner • Assistant sounds impatient or dismissive when user is upset • No vocal adaptation to the user's emotional needs Note: Evaluate the vocal delivery of empathy, not just whether empathetic words were used. ``` ### Example 4: Speaker Diarization Quality ``` Can the two speakers (assistant and user) be clearly distinguished throughout the recording? Audio Analysis Criteria: • Distinct vocal characteristics between speakers • Clear turn-taking without excessive overlap • Ability to attribute each utterance to the correct speaker • Audio quality sufficient for speaker identification Return YES if: • Each speaker has distinguishable vocal qualities • Turn-taking is clear with minimal confusing overlaps • All significant utterances can be attributed to a specific speaker • No extended portions where speaker identity is unclear Return NO if: • Speakers sound too similar to reliably distinguish • Frequent overlapping speech makes attribution difficult • Significant portions have unclear speaker identity • Audio quality issues (echo, distortion) prevent speaker identification Note: This metric evaluates audio clarity for speaker identification, not conversation quality. ``` ### Common Pitfalls to Avoid > **Warning:** **Don't mix audio and text evaluations** in a single Audio metric. If you need > to check both "Did they sound professional?" AND "Did they say the required > disclaimer?", create two separate metrics - an Audio metric for tone and a > Text metric for the disclaimer. | Pitfall | Problem | Solution | | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------- | | Evaluating transcript content | Audio metrics can't reliably assess word choice | Use standard LLM Judge for text content | | Vague audio criteria | "Good voice" is subjective and inconsistent | Define specific qualities: pace, clarity, tone | | Missing speaker specification | Unclear whose voice to evaluate | Always specify: assistant, user, or both | | Combining unrelated qualities | "Clear AND professional AND empathetic" is too broad | Create separate metrics for each quality | --- ## Transcript Scope **Purpose**: Focus metric evaluation on specific portions of a conversation rather than the entire transcript. Transcript Scope allows you to filter which messages the LLM evaluates, reducing noise and improving accuracy for targeted assessments. This feature is available for all LLM Judge metrics (Binary, Numerical, Categorical) and Audio LLM Judge metrics. ### When to Use Transcript Scope | Use Case | Filter Configuration | |----------|---------------------| | Evaluate only agent responses | Role filter: `agent` | | Check the closing of a conversation | Range filter: Last 3 turns | | Assess user sentiment only | Role filter: `user` | | Focus on recent context | Range filter: Last N messages | ### Configuration Options **Transcript Scope Toggle**: - **Full** (default) - Evaluate the entire transcript - **Custom** - Apply filters to focus on specific messages [Image: Transcript Scope UI] [Image: Transcript Scope UI] **Available Filters**: **Role Filter** Limit evaluation to messages from specific speakers: - **Agent** - Only evaluate assistant/agent messages - **User** - Only evaluate user/customer messages - **Both** - Evaluate messages from selected roles This is useful when you want to assess agent behavior without user input affecting the evaluation, or vice versa. **Range Filter** Limit evaluation to a specific portion of the conversation: - **Last N turns** - Evaluate only the final N message exchanges - **First N turns** - Evaluate only the opening N message exchanges This is useful for evaluating specific phases of a conversation, such as greetings, closings, or resolution attempts. ### Transcript Scope for Audio Metrics When using Transcript Scope with Audio LLM Judge metrics, the system automatically: 1. Filters the transcript to the selected messages 2. Uses message timestamps to extract the corresponding audio segments 3. Merges adjacent audio segments (within 0.5 seconds) to avoid artifacts 4. Sends only the filtered audio to the LLM for evaluation This enables focused audio evaluations while reducing processing time and token costs. **Example**: To evaluate only the agent's speech quality in the last 3 turns: - Enable **Custom** transcript scope - Add a **Role filter** for `agent` - Add a **Range filter** for `Last 3 turns` The metric will only analyze the agent's audio from the final 3 exchanges, ignoring user speech and earlier portions of the call. ### Benefits - **More accurate evaluations** - Remove noise from irrelevant messages - **Lower costs** - Process less content per evaluation - **Faster execution** - Smaller context means quicker LLM responses - **Targeted insights** - Focus on the exact conversation segments that matter > **Tip:** Combine multiple filters for precise control. For example, use both a Role filter (agent only) and a Range filter (last 5 turns) to evaluate just the agent's closing performance. --- ## Composite Evaluation **Purpose**: Evaluates a transcript against custom criteria and returns an aggregated score. It assesses each criterion and reports how many passed. **When to use**: Use Composite Evaluation when you need to check whether a conversation meets several requirements at once. ### Use cases - Did the agent greet the customer, verify their identity, and offer a resolution? - Did the response cover all required talking points? - Did the conversation follow each step of a compliance checklist? ### **Implementation** **Criterion Source** - Choose where your criteria come from: - **From Test Case** - Pulls criteria automatically from each test case's Expected Behaviors field. This is useful when different test cases have different criteria. - **Static Criteria** - Define a fixed list of criteria directly on the metric. Every transcript is evaluated against the same set. **Custom Evaluation Prompt** (optional) - Provide additional instructions to guide how each criterion is evaluated. This lets you tailor the evaluation context without editing the criteria. **Additional Options**: - **Knowledge Base** - Enable to give the evaluator access to your knowledge base for more informed assessments. - **LLM Model** - Select which model performs the evaluation. - **Transcript Scope** - Limit evaluation to specific portions of the transcript. See [Transcript Scope](#transcript-scope) for configuration details. **Results**: Each run produces: - An overall score count and percentage of how many passed criteria. - A breakdown showing which criteria passed or failed with reasoning. - A summary explaining the overall evaluation. ### Understanding Result Types Each criterion is evaluated independently and returns one of three results: | Result | Meaning | |--------|---------| | **MET** | Clear evidence in the transcript that the criterion was satisfied | | **NOT_MET** | Evidence that contradicts or fails to satisfy the criterion | | **UNKNOWN** | Insufficient information to determine | > **Warning:** Getting **UNKNOWN** usually means your criterion is too vague or your evaluation prompt lacks context. The evaluator cannot find sufficient evidence in the transcript to make a determination. ### Writing Effective Custom Evaluation Prompts The **Custom Evaluation Prompt** field controls how the evaluator interprets each criterion. A well-written prompt provides context that helps the evaluator understand your domain and make accurate determinations. **Default Behavior**: Without a custom prompt, the evaluator uses semantic matching to determine if each criterion was met. This works well for straightforward criteria but may return UNKNOWN for domain-specific expectations. **When to Use a Custom Prompt**: - Your criteria reference domain-specific terminology - You need the evaluator to understand your agent's role or capabilities - You want to define what counts as "meeting" a criterion in your context **Custom Prompt Examples** **Healthcare Scheduling Agent:** ``` You are evaluating a healthcare scheduling assistant. The agent helps patients book, reschedule, and cancel appointments. It has access to provider availability and patient records. When evaluating criteria: - "Confirms appointment" means the agent stated the date, time, and provider name - "Verifies patient identity" means the agent asked for date of birth or member ID - "Offers alternatives" means the agent suggested at least one other available time slot ``` **Banking Support Agent:** ``` You are evaluating a banking support assistant. The agent handles account inquiries, transaction disputes, and card services. When evaluating criteria: - Account verification requires confirming at least 2 identity factors - "Explains fees" means stating the specific dollar amount and when it applies - Security disclosures must mention fraud protection and reporting procedures ``` **Prompt Structure Guidelines** 1. **State the agent's role** - What does the agent do? What information does it have access to? 2. **Define ambiguous terms** - What does "confirms" or "explains" mean in your context? 3. **Set evaluation standards** - What level of detail counts as meeting a criterion? **Poor prompt:** ``` Evaluate if the agent did a good job. ``` **Effective prompt:** ``` You are evaluating a restaurant reservation assistant. The agent books tables, manages waitlists, and answers questions about menu and hours. A criterion is MET when the agent provides the specific information requested. Partial or vague responses should be marked NOT_MET. If the conversation does not address the topic at all, mark as UNKNOWN. ``` ### Writing Effective Criteria The most common cause of inaccurate results is vague criteria. The evaluator uses semantic understanding, so equivalent meanings count as matches. However, it cannot infer intent from ambiguous statements. **The Specificity Formula** Good criteria follow this pattern: **[Actor] + [Specific Action] + [Specific Information/Outcome]** **Vague vs Specific Examples** | Scenario | Vague (Likely UNKNOWN) | Specific (Reliable) | |----------|------------------------|---------------------| | Appointment booking | "Agent schedules the appointment" | "Agent confirms the appointment date, time, and provider name" | | Account inquiry | "Agent explains the fees" | "Agent states the monthly fee amount and when it is charged" | | Password reset | "Agent helps with password" | "Agent sends a password reset link to the registered email address" | | Escalation | "Agent offers to escalate" | "Agent offers to transfer to a specialist when unable to resolve the issue" | **Why Vague Criteria Fail** Consider the criterion: "Agent explains the account options" This fails because: - "Account options" could mean account types, features, fees, or upgrades - The evaluator cannot determine which aspect you intended - Even if the agent discussed accounts, there's no way to verify the specific expectation was met **Rewritten**: "Agent explains the difference between checking and savings accounts, including minimum balance requirements" Now the evaluator can look for specific information about account types and balance requirements. **Balancing Specificity for Shared Test Sets** When sharing criteria between voice and chat test sets: 1. **Focus on WHAT should happen, not HOW** - Avoid: "Agent says 'I understand your concern'" - Use: "Agent acknowledges the customer's concern before proceeding" 2. **Use outcome-based criteria** - Avoid: "Agent reads the cancellation policy" - Use: "Agent confirms the customer understands the cancellation deadline" 3. **Avoid modality-specific language** - Avoid: "Agent clicks the submit button" - Use: "Agent completes the reservation request" ### Using Agent Evaluation Context Adding your agent's system prompt or context significantly improves evaluation accuracy. The evaluator performs better when it understands what your agent is supposed to do. Navigate to **Agent Settings > Evaluation Context** and add: - What the agent does and what information it has access to - Key policies or procedures it should follow - How it should handle common scenarios **Example Agent Context:** ``` This is a healthcare scheduling assistant that helps patients with: - Booking new appointments with available providers - Rescheduling existing appointments (requires 24-hour notice) - Canceling appointments - Answering questions about office locations and hours The agent should always: - Verify patient identity before making changes - Confirm appointment details before finalizing - Offer alternative times when the requested slot is unavailable ``` ### Troubleshooting UNKNOWN Results If you're getting UNKNOWN results: 1. **Improve your custom prompt** - Add domain context and define what "meeting" a criterion means in your use case 2. **Check criterion specificity** - Is the criterion concrete enough to verify against the transcript? 3. **Add agent evaluation context** - Does the evaluator understand what the agent is supposed to do? 4. **Review the transcript** - Is the expected information actually present in the conversation? 5. **Split compound criteria** - Break "Agent explains X and confirms Y" into two separate criteria --- ## Tool Call Metrics **Purpose**: Evaluate whether AI agent tool calls (functions) were executed correctly ### Prompt Structure Template ``` Given the conversation transcript, [SPECIFIC TOOL CALL EVALUATION QUESTION]? Return YES if: • [Tool call execution criterion 1] • [Tool call execution criterion 2] • [Tool call execution criterion 3] Return NO if: • [Tool call failure condition 1] • [Tool call failure condition 2] • [Edge case for incorrect usage] [CLARIFICATIONS FOR TOOL CALL CONTEXT] ``` ### Example 1: Function Call Accuracy ``` Given the conversation transcript, did the assistant correctly execute the search function with the appropriate parameters? Return YES if: • The search function was called when the user requested information lookup • All required parameters (query, filters) were properly populated • The function call syntax and format were correct • The assistant used the search results appropriately in their response Return NO if: • The search function was called unnecessarily or at wrong times • Required parameters were missing or incorrectly formatted • The function call failed due to syntax errors • The assistant ignored or misused the function results Note: Focus on the technical execution of the tool call, not the quality of the response content. ``` ### Example 2: API Integration Validation ``` Given the conversation transcript, did the assistant properly use the customer lookup API when handling account inquiries? Return YES if: • The API was called only when customer account information was needed • Customer identifier (email, phone, or account number) was correctly passed as parameter • The assistant handled API response data appropriately • Proper error handling was demonstrated if API call failed Return NO if: • The API was called without sufficient customer identification • Wrong parameters were passed to the lookup function • The assistant proceeded without waiting for API response • API errors were not handled gracefully Note: Evaluate the technical integration, not the customer service quality. ``` --- ## API State Matcher **Purpose**: Evaluate the assistant by validating real-world system outcomes via an external API. ### Implementation - Add the URL of the API endpoint - Select `GET` for simple lookups or `POST` if the API requires a body. - Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. **Template Patterns**: `{{expected_output.balance}}`, `{ "status": "success" }`, `completed` - Match path (optional): A dot-notation path used to extract a specific field from the API response. - Timeout (optional): Maximum wait time for the API response before marking the metric as failed. - Headers (optional): Custom HTTP headers sent with the request. ### Use Cases - Verify the agent produced the correct structured output. - Validate mocked API responses in simulations. - Check tool-call results in real services. ### How it works - An HTTP request is sent to the specified API endpoint. - The response body is inspected (optionally at a specific JSON path). - The extracted value is compared against your Expected Body. - Returns 1 if the response body matches the expected value, otherwise returns 0. --- ## Match Expected Simulaton Wrapper **Purpose**: Evaluates an assitant by comparing data captured during the simulation against an expected value. > **Info:** Instead of calling an external API like [API State Matcher](concepts/metrics/prompting#api-state-matcher), this metric inspects simulation wrapper observations (pre- or post-simulation) > and verifies that the recorded response matches expectations. ### Implementation - Select observations (pre- or post-simulation) - Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. **Template Patterns**: `{{expected_output.balance}}`, `{ "status": "success" }`, `completed` ### Use cases - Verify the agent produced the correct structured output. - Validate mocked API responses in simulations. - Test tool-call results without affecting real services. ### How it works - Using the wrapper observations (for example, API pre-simulation or post-simulation payloads). - This metric reads the selected observation. - Extracts a value using a match path and compares the result to the expected body. - Returns 1 if the response body matches the expected value, otherwise returns 0. --- ## Metadata Field Metric **Purpose**: Reports the value of a run's metadata field, retrieved from the custom metadata using a specified key. The value may be a number, text, or boolean. ### Implementation 1. Select the metadata field type: **string**, **float**, or **boolean**. 2. Input the metadata field key. > **Warning:** Warning: Works only if you send metadata as part of your transcripts to evaluate with Coval, > this metric will take the specific metadata field's value and output that result as a metric result. ### Use Cases - Track custom business metrics (e.g. customer satisfaction scores, call type). - Monitor agent performance indicators passed through metadata. - Extract conversation context data for analysis. - Aggregate custom KPIs from your application. - Track boolean flags (e.g. escalation occurred, customer authenticated, issue resolved). ### How It Works - The metric returns the exact value stored in the specified metadata field. - Automatically aggregates values across multiple conversations. - Direct field value extraction with no LLM processing required. - Supports numeric, text, and boolean metadata values. - Boolean values are output as float (0.0 for false, 1.0 for true) for proper metric aggregation. --- ## Transcript Regex Match Metrics **Purpose**: Pattern detection for exact phrase matching, compliance validation, and format verification. ### Implementation Configure the **Regex Pattern** field (required) and optional fields below. No text prompt is required for this metric type. ### Configuration Fields | Field | Required | Default | Description | |-------|----------|---------|-------------| | **Regex Pattern** | Yes | — | Regular expression pattern to match against the transcript | | **Role** | No | All messages | Filter by speaker role: `AGENT`, `PERSONA`, `TOOL`, `SYSTEM`, or `MUSIC` | | **Match Mode** | No | `presence` | `presence` returns 1 if pattern is found; `absence` returns 1 if pattern is NOT found | | **Position** | No | `any` | `any` checks all messages, `first` checks only the first message, `last` checks only the last message (of the filtered role) | | **Case Insensitive** | No | `false` | When enabled, pattern matching ignores case | ### Pattern Design Guidelines - Use word boundaries **(`\b`)** for exact word matching. - Enable **Case Insensitive** matching instead of using inline `(?i)` flags for clarity. - Use **Position** filtering instead of complex anchoring when you only care about the first or last message. - Use **Absence** mode for compliance rules ("agent must not say X") instead of trying to negate patterns in regex. Test patterns thoroughly before deployment. ### Use Case Examples #### Example 1: Greeting Detection **Goal**: Detect if the agent uses a proper greeting phrase. **Regex Pattern**: `\b(hello|hi|good morning|good afternoon|good evening)\b` **Role**: `AGENT` **Case Insensitive**: Enabled **Returns**: 1 if greeting found, 0 if no greeting detected. #### Example 2: Required Disclosure in First Message **Goal**: Verify the agent states a required disclosure at the start of the conversation. **Regex Pattern**: `this call may be recorded` **Role**: `AGENT` **Position**: `first` **Case Insensitive**: Enabled **Returns**: 1 if disclosure is in the first agent message, 0 if missing. #### Example 3: Prohibited Language (Compliance) **Goal**: Ensure the agent never makes unauthorized promises. **Regex Pattern**: `\b(guarantee|promise|100%|definitely)\b` **Role**: `AGENT` **Match Mode**: `absence` **Case Insensitive**: Enabled **Returns**: 1 if the agent did NOT use prohibited language (pass), 0 if prohibited language was found (fail). #### Example 4: Closing Statement in Last Message **Goal**: Verify the agent ends the conversation with a proper closing. **Regex Pattern**: `(goodbye|have a (great|nice|lovely) day|thank you for calling)` **Role**: `AGENT` **Position**: `last` **Case Insensitive**: Enabled **Returns**: 1 if closing statement found in last agent message, 0 if missing. #### Example 5: Phone Number Format Validation **Goal**: Detect when the user provides a phone number in standard US format. **Regex Pattern**: `\b\d{3}[-.]?\d{3}[-.]?\d{4}\b` **Role**: `PERSONA` **Returns**: 1 if valid format detected, 0 if invalid or missing. ### How It Works 1. The metric filters transcript messages by **Role** (if specified). If no role is set, all messages are checked. 2. The **Position** filter is applied: `first` keeps only the first matching message, `last` keeps only the last. 3. The **Regex Pattern** is matched against the filtered messages, with **Case Insensitive** applied if enabled. 4. The **Match Mode** determines the result: - `presence`: returns 1 if the pattern was found, 0 if not. - `absence`: returns 1 if the pattern was NOT found, 0 if it was. 5. Direct pattern matching — no LLM required, fast and deterministic. --- ### Words Per Message (Threshold) **Purpose**: Validates that all agent messages meet a configurable word count requirement. **What it measures**: Whether every agent message satisfies a word count condition — for example, "all messages must have fewer than 50 words" or "all messages must have at least 5 words." **When to use**: - Enforcing response length guidelines (e.g., keeping answers concise) - Detecting unexpectedly short or empty responses - Validating that the agent doesn't produce overly verbose replies **How it works**: Counts words in each agent message and checks whether all messages satisfy the configured operator and threshold. Returns YES only if every message passes; NO if any message fails. **How to interpret**: - **YES** = all agent messages meet the word count condition. - **NO** = at least one message violated the condition. The detail view identifies which messages failed and their word counts. --- ## Customized Audio Metrics ### Custom Pause Analysis **Purpose**: Measures how frequently the agent pauses mid-speech and how long those pauses are. **What it measures**: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. **When to use**: - Identifying unnatural or excessive hesitations in agent speech - Detecting processing delays that manifest as in-speech pauses - Evaluating speech fluency across different configurations **How it works**: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. **How to interpret**: - Lower values indicate more fluent speech. - The detail view shows each pause with its timestamp and duration. - Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts. ### Volume Variance **Purpose**: Measures how consistently the agent maintains volume throughout the conversation. **What it measures**: Standard deviation of audio volume (in dB) across agent speech — lower values indicate more consistent volume. **When to use**: - Identifying erratic loudness changes in agent speech - Ensuring consistent audio quality across a call - Comparing voice model configurations for volume stability **How it works**: Divides agent speech into fixed-length intervals and measures the volume of each. Intervals are flagged as too loud or too soft based on absolute dBFS thresholds. The primary score is the standard deviation across all intervals. To adjust sensitivity there are different thresholds available: | Preset | Loud threshold | Soft threshold | |--------|---------------|----------------| | `strict` | above -3 dBFS | below -30 dBFS | | `normal` (default) | above -6 dBFS | below -35 dBFS | | `lenient` | above -9 dBFS | below -40 dBFS | You can also override thresholds individually with `loud_threshold_db`, `soft_threshold_db`, or `interval_seconds`. **How to interpret**: - Lower standard deviation = more consistent volume. - The detail view shows only the problematic intervals (too loud or too soft) with their timestamps and dB values. ### Abrupt Pitch Changes **Purpose**: Detects sudden, jittery transitions in pitch that can make speech sound unnatural. **What it measures**: Distinct segments where pitch changes abruptly between frames, reported as events per minute. **When to use**: - Detecting unnatural speech characteristics in synthesized voices - Identifying voice models with unstable or jittery pitch - Comparing voice configurations for smoothness **How it works**: Compares pitch values frame-by-frame, flags frames where the change exceeds a threshold, and groups consecutive flagged frames into segments. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `significant_changes_threshold_hz` | `200.0` | Minimum pitch change in Hz to consider a transition abrupt | **How to interpret**: - Lower values indicate smoother, more natural pitch transitions. - Higher values suggest jittery or unstable pitch. ### Volume/Pitch Misalignment **Purpose**: Detects moments where pitch and volume move in opposite directions, which can indicate unnatural prosody in synthesized speech. **What it measures**: Frames where the pitch is rising while volume is falling (or vice versa), scored by severity relative to the clip's own baseline. **When to use**: Identifying unnatural-sounding speech output — for example, a voice that gets louder while its pitch drops unexpectedly, or vice versa. Useful for: - Evaluating text-to-speech engine quality - Detecting prosody issues that may sound "off" to listeners - Comparing voice model configurations **How it works**: Analyzes frame-by-frame pitch and volume changes across the audio. Frames where the two signals diverge in opposite directions are flagged. Each event receives a severity score based on how unusual the divergence is relative to the rest of the clip (using z-scored magnitudes), making the metric robust across different speakers and recording conditions. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `min_volume_change_for_pitch_misalignment` | `7` | Minimum intensity change (dB) required to flag a misalignment event | **How to interpret**: Severity scores are **relative to the clip**, not absolute. A higher score means both pitch and volume were moving unusually for this speaker in this recording. - **Low severity (~0 – 1)**: Both signals are near their mean change magnitude — nothing unusual relative to the speaker's baseline. - **Medium severity (~1 – 2)**: One or both signals are about 1 standard deviation above their clip mean. - **High severity (~2–6+)**: Both signals are 2+ standard deviations above their clip mean — a genuinely unusual frame. Because severity is z-score based, values are comparable across different speakers and recording conditions. ### Non-Expressive Pauses **Purpose**: Identifies pauses in speech that lack preparatory pitch movement, which can make the agent sound flat or monotone. **What it measures**: Pauses above a minimum duration where pitch shows little variation in the frames immediately before the pause, reported as events per minute. **When to use**: - Evaluating whether a voice sounds expressive and natural - Detecting monotone delivery in synthesized speech - Comparing voice configurations for expressiveness **How it works**: Detects pauses above a minimum duration threshold, then examines the pitch trajectory in the frames immediately preceding each pause. Pauses with minimal pitch variation beforehand are flagged as non-expressive. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `min_pause_duration_seconds` | `0.6` | Minimum silence duration (s) to qualify as a pause | | `pre_pause_window` | `5` | Number of 10ms frames to inspect before each pause for pitch movement | **How to interpret**: - Lower values indicate more expressive delivery — pitch varies naturally before pauses. - Higher values suggest a flat or robotic cadence where pauses arrive without natural pitch cues. ### Vocal Fry **Purpose**: Detects vocal fry — a low, creaky speech quality, typically occurring at the end of phrases. **What it measures**: Total time spent in vocal fry (in seconds), with additional detail on percentage of affected speech and longest continuous fry segment. **When to use**: - Evaluating whether a voice has creaky or rough-sounding artifacts - Monitoring vocal quality across different voice configurations - Identifying voices where fry affects listener experience **How it works**: Identifies frames with simultaneously low pitch, high acoustic roughness, and irregular vocal cord vibration. Consecutive flagged frames are grouped into fry segments. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `sample_rate_seconds` | `0.01` | Analysis frame rate in seconds | | `pitch_floor` | `60` | Minimum pitch frequency (Hz) for detection | | `pitch_ceiling` | `400` | Maximum pitch frequency (Hz) for detection | | `low_pitch_threshold_multiplier` | `0.6` | Fraction of speaker's median pitch below which a frame is considered low-pitched | | `jitter_threshold_multiplier` | `2.0` | Multiple of baseline jitter above which a frame is flagged | | `harmonics_to_noise_ratio_threshold_offset_db` | `-10.0` | dB offset below baseline HNR that marks a frame as noisy | | `harmonics_to_noise_ratio_minimum_pitch` | `60` | Minimum pitch for HNR calculation (Hz) | | `harmonics_to_noise_ratio_silence_threshold` | `0.1` | Amplitude threshold below which frames are treated as silent | | `harmonics_to_noise_ratio_periods_per_window` | `1.0` | Analysis window size in pitch periods for HNR | | `baseline_calculation_multiplier` | `0.8` | Fraction of median pitch used to define the "clear voice" baseline for HNR and jitter | | `min_fry_segment_seconds` | `0.05` | Minimum duration (s) for a fry segment to be counted | **How to interpret**: - Total time in vocal fry (seconds). Lower is better. - Occasional brief fry is common in natural speech; sustained or frequent fry may reduce perceived quality. ### Spectrogram Pitch Analysis **Purpose**: Evaluates whether audio contains natural upper-frequency content, which is a key indicator of voice naturalness. Synthetic or bandwidth-limited audio often lacks energy in higher frequency ranges. **What it measures**: The fraction of upper-frequency spectrogram bins that have energy above a noise floor, averaged across analysis windows. Returns **1.0 (pass)** or **0.0 (fail)** based on whether the average fill ratio meets the naturalness threshold. **When to use**: - Detecting bandwidth-limited or muffled synthesized speech - Comparing voice model configurations for spectral richness - Identifying voices that lack harmonic upper-frequency energy **How it works**: Splits the audio into fixed-length windows and computes a frequency spectrum for each. The fraction of bins in the upper frequency region that exceed the noise floor is measured per window. If the average fill ratio across all windows meets the naturalness threshold, the metric passes. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `naturalness_threshold` | `0.10` | Minimum average fill ratio (0.0–1.0) to pass | | `upper_region_percentage` | `0.25` | Fraction of the frequency range treated as the upper region | | `noise_floor_db` | `-15.0` | dB level above which a bin counts as filled | | `segment_length_seconds` | `2.0` | Duration of each analysis window | **How to interpret**: - **1.0** = pass — average upper-frequency fill ratio meets the naturalness threshold. - **0.0** = fail — audio lacks sufficient upper-frequency energy. - The detail view shows the fill ratio per window across the recording timeline. --- ## Using Trace Context in LLM Judge Metrics **Purpose**: Give an LLM Judge or Composite Evaluation metric visibility into what your agent actually did — not just what it said — by including OpenTelemetry span data alongside the transcript. When **Include Traces** is enabled on a custom transcript scope, the judge automatically receives a `TRACE CONTEXT:` block appended to its prompt. This block summarizes the OTel spans from the conversation: span names, timing windows, and key attributes like tool call names and function arguments. For **Composite Evaluation** metrics (Expected Behaviors), the same `TRACE CONTEXT:` block is appended to every per-criterion prompt — each criterion is evaluated against both the transcript and the trace spans inside the configured scope. ### Walkthrough [Video: Loom Video](https://www.loom.com/embed/17d8a2dcb55e46b49cde11c515acc658) ### When to Enable Include Traces Trace context is most valuable when the behavior you want to evaluate isn't visible in the transcript alone: | Use Case | Why Traces Help | |----------|----------------| | Verify the agent used the right tools in the right order | Tool call spans show what functions were invoked and with what arguments | | Catch hallucinations — agent claimed to do something it didn't | Trace spans show whether the action actually occurred | | Evaluate retrieval quality | Retrieval spans show what data was fetched before the agent responded | | Assess error handling | Error spans reveal failures the agent may have silently recovered from | ### How to Enable 1. Open or create a supported metric — LLM Judge (Binary, Numerical, Categorical, or Audio) or Composite Evaluation. 2. Set **Transcript Scope** to **Custom**. 3. In the custom scope configuration panel, toggle **Include Traces** on. The trace context is appended automatically — no changes to your judge prompt are required, though you can reference it explicitly for better results. ### Requirements - Your agent must emit OpenTelemetry traces to Coval. See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup. - The simulation must have produced trace data. If no trace data is available, the toggle has no effect and the prompt is sent without a trace context block. ### Writing Prompts That Leverage Trace Context When writing prompts for metrics with trace context enabled, reference the trace data explicitly. The judge sees a `TRACE CONTEXT:` block appended after the transcript — you can instruct it to reason about both sources. #### Example: Verify Tool Usage ``` Given the transcript and trace context, did the assistant call the `lookup_account` function before providing account balance information? Return YES if: • The TRACE CONTEXT shows a tool call to `lookup_account` (or equivalent) occurring before the agent stated the balance • The transcript confirms the agent provided balance details Return NO if: • The agent mentioned account balance information but no `lookup_account` tool call appears in the TRACE CONTEXT • The tool call appears AFTER the agent has already stated the balance (out of order) • The TRACE CONTEXT shows a failed or missing tool call for this operation Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone. ``` #### Example: Catch Hallucination ``` Given the transcript and trace context, did the assistant accurately report what actions it took? Return YES if: • All actions the assistant claims to have performed appear in the TRACE CONTEXT as actual tool or function calls Return NO if: • The assistant stated it performed an action (e.g., "I've updated your address") but no corresponding tool call appears in the TRACE CONTEXT • The TRACE CONTEXT shows an error or missing call for an action the assistant claimed was successful Note: Minor phrasing differences between the transcript and trace data are acceptable — evaluate intent. ``` > **Tip:** Add "Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone" to your prompt. This makes the metric degrade gracefully on simulations where traces weren't captured. --- ## Utilizing Attributes You can embed dynamic values from agents, test cases, and simulations into your metric prompts using template variables. This allows you to create context-aware metrics that adapt to specific agent configurations or test case requirements. For comprehensive documentation on using attributes, including nested paths, array indexing, dynamic keys, and complete examples, see [Attributes](/concepts/attributes/overview). --- ## Advanced Prompting Techniques ### 1. **Chain of Thought for Complex Evaluations** ``` Before making your final determination, consider: 1. What was the user's primary goal? 2. What actions did the assistant take? 3. What was the final outcome? 4. Did the outcome match the user's goal? Based on this analysis, did the assistant successfully resolve the user's issue? ``` ### 2. **Few-Shot Examples for Edge Cases** ``` Examples of what constitutes resolution: • User: "That fixed it, thanks!" → YES • User: "I'll try that and call back if needed" → YES • User: "This is too complicated, forget it" → NO • User hangs up without confirmation → NO Given the transcript, did the assistant successfully resolve the user's issue? ``` ### 3. **Hierarchical Decision Making** ``` First, determine if the assistant attempted to address the user's concern: - If no attempt was made → Return NO - If attempt was made → Continue to step 2 Second, evaluate if the attempt was successful: - If user confirmed satisfaction → Return YES - If user remained unsatisfied → Return NO - If outcome unclear → Return NO (err on conservative side) ``` --- ## Using Agent Attributes and Test Case Attributes You can make your metric prompts more dynamic and context-aware by referencing agent attributes and test case attributes. This allows you to create metrics that evaluate agent performance against specific agent configurations or test case requirements. ### Agent Attributes Agent attributes are custom properties you define for each [agent configuration](/concepts/agents/overview#attributes). **How to use agent attributes in metric prompts:** Insert `{{agent.attribute_name}}` anywhere in your metric prompt. The system will automatically replace this placeholder with the actual attribute value from the agent being evaluated. **Example 1: Business Hours Verification** ``` Given the transcript, did the assistant provide the correct opening hours? The correct opening hours are: {{agent.opening_hours}} Return YES if: • The assistant stated the opening hours as {{agent.opening_hours}} • The assistant provided opening hours that match exactly (e.g., "9 AM to 5 PM" matches "9:00am-5:00pm") Return NO if: • The assistant provided different opening hours than {{agent.opening_hours}} • The assistant claimed not to know the opening hours • The assistant provided incorrect or conflicting information ``` ### Test Case Attributes For a test case with attributes like: ```json { "source": "LAX", "destination": "SFO", "ticket_class": "business" } ``` You could create a metric prompt: ``` Given the transcript, did the assistant correctly process the flight booking request? The booking details are: - Source: {{test_case.source}} - Destination: {{test_case.destination}} - Ticket Class: {{test_case.ticket_class}} Return YES if: • The assistant confirmed all three details correctly (source, destination, and ticket class) • The assistant used the exact values: {{test_case.source}}, {{test_case.destination}}, and {{test_case.ticket_class}} Return NO if: • Any of the three details were incorrect or missing • The assistant confused source and destination • The assistant used a different ticket class than {{test_case.ticket_class}} ``` ### Combining Agent and Test Case Attributes You can use both agent attributes and test case attributes in the same metric prompt to create comprehensive evaluations: --- ## Knowledge Base Metrics Coval allows you to connect a knowledge base (KB) to your agent and create LLM Judge metrics that use your knowledge base as context. This enables you to track accuracy on specific articles, knowledge bases, or different flows mentioned in your documentation. > **Info:** Knowledge bases are configured on the agent, not the metric. See [Knowledge Base](/concepts/agents/knowledge-base) for how to add entries and supported source types. **Use cases for KB metrics:** - Verify agents answer questions using approved knowledge base content. - Track accuracy across different documentation sources. - Ensure compliance with specific information in FAQs, policies, or procedures. - Monitor whether agents provide consistent responses based on authoritative sources. > **Tip:** **Pro Tip:** KB metrics are particularly valuable for customer service agents, healthcare bots, or any application where accuracy against documented information is critical. ### Setting Up Your Knowledge Base **Step 1: Navigate to Agent Configuration** 1. Go to your **Agent** setup page 2. Select the agent you want to connect to a knowledge base 3. Scroll down to the **Knowledge Base** section **Step 2: Add Knowledge Base Entries** Coval supports multiple knowledge base formats 1. Click "Add Knowledge Base Entry" 2. Select your file type 3. Upload your file (Coval will automatically parse it) 4. Add a descriptive name (e.g., "Hotel FAQ", "Product Documentation") 5. Optionally add tags for organization 6. Click "Upload" ![image.png](/images/image.png) All uploaded entries will appear in your knowledge base list, associated with the selected agent. ### Creating Knowledge Base Metrics **Step 1: Create a New Metric** 1. Navigate to the **Metrics** section 2. Click "Create New Metric" 3. Select **Binary LLM Judge** as the metric type 4. Name your metric (e.g., "FAQ Knowledge Base Accuracy") ### Step 2: Write Your LLM Judge Prompt Structure your prompt to evaluate whether the agent used knowledge base information correctly: **Example Prompt Structure:** ``` Given the transcript, did the assistant answer the user's initial question accurately using information from the Hotel FAQ knowledge base? Return YES if: - The assistant provided specific FAQ details that are factually correct (exact addresses, dollar amounts, precise policies, named amenities) - Core facts match the FAQ even if paraphrased (e.g., "4:00pm check-in" can be stated as "check-in at 4 PM") - The response directly addresses the user's initial question with accurate FAQ information Return NO if: - The assistant provided information that contradicts the FAQ (e.g., claiming there is a pool when FAQ states there is no pool) - The assistant gave generic responses without specific FAQ details - The assistant fabricated information not contained in the FAQ - The assistant claims lack of information when the FAQ contains the answer - The initial question remains unanswered despite FAQ coverage - The assistant provided factually incorrect information, even if detailed and specific Return Unknown if: - The user's question is not covered in the FAQ **Critical: Prioritize factual accuracy over response detail. A detailed but incorrect answer must return NO.** ``` > **Tip:** **Best Practice:** Be specific about what constitutes accurate vs. inaccurate > responses based on your knowledge base. Include edge cases where the KB might > not have complete information. **Step 3: Enable Knowledge Base Context** **Critical step:** At the bottom of the metric configuration: 1. Locate the **Knowledge Base** toggle (initially disabled) 2. **Enable** the Knowledge Base option 3. The system will automatically include your knowledge base as context when evaluating > **Warning:** **Critital**: If you don't **enable the Knowledge Base toggle**, the metric will evaluate > without KB context and may produce inaccurate results. ### Step 4: Save Your Metric 1. Review your prompt and settings 2. Click "Create Metric" 3. Your KB metric is now ready to use in simulations and conversations ### Using Knowledge Base Metrics in Evaluations **In Simulations** 1. Create or select a test set with scenarios that should use KB information 2. Launch a simulation (or use a template) 3. Select your KB accuracy metric in the metrics list 4. Run the simulation **In Conversations** 1. Set your KB metric as a **Default Metric** to run on all incoming transcripts 2. Create **Metric Rules** to apply KB metrics conditionally 3. Monitor results in real-time to catch KB accuracy issues in production ## Best Practices for Knowledge Base Metrics ### Writing Effective Prompts **Do:** - Be explicit about what information should come from the KB. - Define clear conditions for YES and NO responses. - Account for situations where the KB doesn't have complete information. - Consider partial accuracy vs. complete inaccuracy. **Don't:** - Make assumptions about what the LLM knows without KB context. - Create overly complex evaluation criteria. ### Knowledge Base Organization **Recommended structure:** - Use clear, descriptive names for each KB entry. - Add tags to categorize different types of information. - Keep individual KB files focused on specific topics. - Update KB entries regularly to reflect current information. ## Metric Validation and Testing ### 1. **Metric Improvement Process** - Use Coval's "Improve Metric" feature with test transcripts. - Iterate on prompts to reduce variance. - Test edge cases and ambiguous scenarios. - Aim for \>90% consistency across similar evaluations. ### 2. **Common Issues and Solutions** | Issue | Solution | | -------------------- | -------------------------------------------------- | | Inconsistent scoring | Add more specific criteria and examples | | Edge case failures | Include explicit handling for boundary conditions | | LLM hallucination | Use more structured prompts with clear constraints | | Low correlation | Ensure metric measures what you intend to measure | ### 3. **Performance Optimization** - Keep prompts under 2,000 characters when possible. - Use regex metrics for simple pattern detection. - Combine related evaluations into single metrics when logical. - Test with diverse conversation types and lengths. ## Best Practices Summary For Creating Metric Prompts - Use specific, measurable criteria. - Provide clear positive and negative examples. - Test extensively with real conversation data. - Maintain consistent terminology and structure. - Include edge case handling. This systematic approach to metric creation will ensure reliable, actionable insights from your Coval evaluations. --- ## Custom Trace Metrics Source: https://docs.coval.ai/concepts/metrics/custom-trace-metrics Extract numerical values from OpenTelemetry spans to measure custom latency, performance, and behavior signals. ## Walkthrough [Video: Loom Video](https://www.loom.com/embed/54f0c0062ea045ceb65d8feecc9cbd92) ## Overview Custom Trace Metrics let you extract a specific numerical value from your agent's OpenTelemetry spans and aggregate it across all turns in a simulation. Use Custom Trace Metrics when you have a signal already captured in your traces — latency measurements, confidence scores, token counts, retry attempts — that you want to track and trend across runs. ## Prerequisites Your agent must be instrumented with OpenTelemetry and sending spans to Coval. See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup instructions. If traces are not present for a simulation, the metric will report an error at execution time. ## Configuration When creating a Custom Trace Metric, configure three fields: | Field | Description | |-------|-------------| | **Span Name** | The name of the OTel span to query (e.g. `llm`, `tts`, `stt`, `llm_tool_call`, or any custom span name you emit). | | **Metric Attribute** | The span attribute to extract the value from (e.g. `retrieval_latency_ms`, `confidence_score`, or another custom numeric attribute key). | | **Aggregation Method** | How to aggregate the extracted values across all matching spans in the simulation. | ### Aggregation Methods | Method | Description | |--------|-------------| | **Average** | Mean value across all matching spans. Best for typical-case latency or scores. | | **Median** | Median value across all matching spans. More robust to outliers than average. | | **p90** | 90th-percentile value. Best for understanding worst-case performance at scale. | | **p95** | 95th-percentile value. Useful for tail latency on larger samples. | | **p99** | 99th-percentile value. Useful for rare but severe latency spikes. | | **Max** | Maximum value observed across all matching spans. Useful for worst-case detection. | | **Min** | Minimum value observed across all matching spans. | | **Sum** | Total value across all matching spans. Useful for token counts, cost-like counters, and accumulated work. | | **Count** | Number of matching spans. Useful for tool calls, retries, fallbacks, handoffs, and critical events. | | **Error Rate** | Percentage of matching spans with an error status. | | **Success Rate** | Percentage of matching spans with a successful status. | For `count`, `error_rate`, and `success_rate`, the metric can aggregate matching spans directly. For numeric aggregations such as `average`, `p95`, or `sum`, choose a numeric span attribute. ### Span Names Any span name your agent emits can be queried. The following well-known span names map to Coval's built-in trace components: | Span Name | Component | |-----------|-----------| | `llm` | Language model invocations | | `tts` | Speech synthesis | | `stt` | Speech recognition | | `llm_tool_call` | Individual tool/function calls | | `turn` | A single conversation turn | Custom span names (e.g. `document_retrieval`, `database_lookup`) work as well — use whatever names your agent emits. ## How to Create **Step: Open the Metrics page** Navigate to the **Metrics** section in the Coval dashboard. **Step: Click Create Metric** Select **Custom Trace Metrics** from the metric type group. **Step: Configure the metric** Fill in **Span Name**, **Metric Attribute**, and **Aggregation Method** for your use case. **Step: Name and save** Give the metric a descriptive name and save. It is now available to add to any run. ## Use Cases ### Custom Latency Tracking Extract average document retrieval latency from your custom retrieval spans: | Field | Value | |-------|-------| | Span Name | `document_retrieval` | | Metric Attribute | `retrieval_latency_ms` | | Aggregation Method | Average | This gives you the average retrieval latency across all turns in the simulation. Compare it across runs to catch regressions after changes to your index, embeddings, or chunking strategy. ### p90 External API Latency Track tail latency for an external service your agent depends on: | Field | Value | |-------|-------| | Span Name | `weather_api` | | Metric Attribute | `duration_ms` | | Aggregation Method | p90 | Use p90 instead of average when you care about tail performance instead of typical performance, especially for services that can occasionally spike. ### Tool Call Duration Monitoring If your agent emits custom spans for specific tool calls with a duration attribute: | Field | Value | |-------|-------| | Span Name | `database_lookup` | | Metric Attribute | `duration_ms` | | Aggregation Method | Average | ### Confidence Score Extraction If your agent records a confidence score on each language model span: | Field | Value | |-------|-------| | Span Name | `llm` | | Metric Attribute | `confidence_score` | | Aggregation Method | Average | > **Tip:** Custom Trace Metrics complement built-in trace metrics like **LLM Time to First Byte** and **TTS Time to First Byte**. Use the built-in metrics for standard pipeline components and Custom Trace Metrics for signals specific to your agent's instrumentation. > **Tip:** Want an AI-assisted setup? Use [Tracing Skills](/concepts/simulations/traces/tracing-skills) to have your coding agent inspect real traces, recommend 3-6 useful metrics, and create only metrics backed by span data that exists. --- ## Metric Chaining Source: https://docs.coval.ai/concepts/metrics/MetricChaining Combine metrics into custom logic flows for more efficient and accurate evaluations Metric Chaining allows you to create conditional metric flows where a follow-up metric runs only when specific criteria are met by a trigger metric. This approach helps you avoid cramming multiple evaluation checks into a single metric while ensuring accuracy and efficiency in your evaluations. Instead of creating one complex metric that tries to handle multiple scenarios, you can break down your evaluation logic into separate, focused metrics that run conditionally based on the results of previous metrics. ## Benefits of Metric Chaining - **Improved Accuracy**: Each metric focuses on a specific aspect of the conversation - **Efficiency**: Only run metrics that are relevant to the specific conversation flow - **Clarity**: Separate concerns make metrics easier to understand and maintain - **Flexibility**: Create complex evaluation logic without overwhelming single metrics ## How Metric Chaining Works 1. **Trigger Metric**: The primary metric that runs first and determines whether additional metrics should execute 2. **Follow-up Metric**: The secondary metric that runs conditionally based on the trigger metric's result 3. **Criteria**: The condition that determines when the follow-up metric should run (e.g., "equal to 0", "greater than 0") ## TL;DR Walkthrough: [Video: Loom Video](https://www.loom.com/embed/743bcac22b124fc09fe7c49e5429ad87?sid=871e5013-0f7d-4102-8d7b-2755cb684424) ## Example Use Case: Appointment Setter Agent Consider an appointment setter agent with two distinct evaluation needs: 1. **Repeat Caller Handling**: Check if the agent correctly identifies repeat callers 2. **Patient Information Collection**: Check if the agent collects necessary information (first name, last name, phone number) Without metric chaining, you might be tempted to create one large metric covering both scenarios. With metric chaining, you can: - Use "Repeat Caller Handling" as your trigger metric - If it returns "No" (agent couldn't identify as repeat caller), then run "Patient Information Collection" - If it returns "Yes" (agent identified repeat caller), skip the information collection check ## Setting Up Metric Chaining ### Prerequisites Before creating a metric chain, ensure you have: 1. Created your trigger metric 2. Created your follow-up metric 3. Tested both metrics individually ### Creating a Metric Chain 1. Navigate to **Metric Chains** in your dashboard 2. Click **"Add a Metric Chain"** 3. Configure the chain: - **Status**: Set as Active or Inactive - **Trigger Metric**: Select your primary metric - **Follow-up Metric**: Select the metric to run conditionally - **Criteria**: Define when the follow-up metric should run - **Equal to 0**: Run follow-up when trigger returns "No" - **Greater than 0**: Run follow-up when trigger returns "Yes" - **Other conditions**: As needed for your use case 4. **Save** your metric chain ### Applying Metric Chains in Evaluations When launching an evaluation with metric chains: 1. In your evaluation setup, select **only the trigger metric** 2. The follow-up metric will automatically run based on your chain conditions 3. Do not manually select the follow-up metric - the chain will handle this \ Only select the trigger metric when launching evaluations. The chained metrics will run automatically based on your configured conditions. \ ## Example Results ### Scenario 1: Trigger Metric Returns "Yes" - **Trigger**: "Repeat Caller Handling" returns **Yes** - **Result**: Agent successfully identified repeat caller - **Chain Action**: Follow-up metric does NOT run - **Transcript Example**: "Perfect, John Doe, I see you..." ### Scenario 2: Trigger Metric Returns "No" - **Trigger**: "Repeat Caller Handling" returns **No** - **Result**: Agent could not identify repeat caller - **Chain Action**: "Patient Information Collection" runs automatically - **Transcript Example**: "I can't find your information in our system, I'm sorry..." ## Best Practices - **Start Simple**: Begin with two-metric chains before creating more complex flows - **Test Individually**: Ensure each metric works correctly on its own before chaining - **Clear Logic**: Make sure your chain conditions align with your evaluation goals - **Document Chains**: Keep track of your metric chain logic for team collaboration ## Advanced Usage Metric chaining can be extended for more complex scenarios: - **Multi-step Chains**: Chain multiple metrics in sequence - **Different Conditions**: Use various threshold conditions for triggering - **Business Logic**: Implement complex business rules through chained evaluations > **Info:** **Need more complex metric chaining scenarios?** Contact our team to discuss advanced metric chain configurations for your specific use case. # Metric Chaining vs Workflow Verification: When to Use Each > Understanding the differences and choosing the right evaluation approach for your use case Both Metric Chaining and Workflow Verification help you evaluate conditional logic in your agent conversations, but they serve different purposes and work in distinct ways. This guide helps you choose the right approach for your specific evaluation needs. ## Overview Comparison | Feature | Metric Chaining | Workflow Verification | | :------------------- | :---------------------------------- | :--------------------------------------- | | **Purpose** | Custom conditional evaluation logic | Pre-defined workflow compliance checking | | **Setup Complexity** | Moderate (create multiple metrics) | Simple (uses existing agent workflow) | | **Flexibility** | High - any conditional logic | Limited to predefined workflows | | **Granularity** | Separate results for each condition | Single workflow compliance score | | **Efficiency** | Runs only relevant metrics | Evaluates entire workflow path | ## When to Use Metric Chaining Choose **Metric Chaining** when you need: ### **Custom Conditional Logic** - Complex "if-then" scenarios that don't follow a linear workflow - Multiple branching conditions based on conversation context - Business rules that vary based on user characteristics or responses ### **Granular Insights** - Separate scores for each evaluation step - Detailed breakdown of where conversations succeed or fail - Ability to analyze specific conditional branches independently ### **Efficiency Optimization** - Avoid running irrelevant evaluations - Save computation costs on large-scale monitoring - Focus evaluation resources on applicable scenarios ### **Example Use Cases** - **New vs. Returning Users**: "If user is new → check info collection, if returning → check account verification" - **Product-Specific Flows**: "If insurance inquiry → check coverage questions, if claims → check claim validation" - **Escalation Scenarios**: "If technical issue → check troubleshooting steps, if billing → check payment verification" ## When to Use Workflow Verification Choose **Workflow Verification** when you have: ### **Pre-Defined Linear Workflows** - Clear, sequential steps your agent should follow - Workflows already configured during agent creation - Standard operating procedures that rarely change ### **Overall Compliance Checking** - Need to verify agents follow established processes - Simple pass/fail evaluation for entire workflow - Regulatory or compliance requirements ### **Quick Setup Requirements** - Want immediate evaluation without creating custom metrics - Have straightforward, documented agent workflows - Need basic workflow adherence monitoring ### **Example Use Cases** - **Customer Service Flow**: "Greeting → Issue Identification → Resolution → Closure" - **Sales Process**: "Qualification → Needs Assessment → Presentation → Close" - **Support Tickets**: "Intake → Categorization → Assignment → Resolution" ## Detailed Example: Appointment Scheduling Agent Let's compare how each approach handles an appointment scheduling scenario: ### **Scenario**: Agent should collect different information based on appointment type **Requirements:** - New patient appointments: Collect name, phone, insurance - Follow-up appointments: Verify existing info, confirm time - Emergency appointments: Prioritize urgency, collect minimal info ### **Metric Chaining Approach** ``` Trigger Metric: "Appointment Type Identification" ├── If "New Patient" → Run "New Patient Info Collection" ├── If "Follow-up" → Run "Existing Patient Verification" └── If "Emergency" → Run "Emergency Prioritization Check" ``` **Benefits:** - Each appointment type gets targeted evaluation - Separate success rates for different flows - No wasted evaluations on irrelevant scenarios **Results Example:** - Appointment Type ID: 95% success - New Patient Info: 87% success (only for new patients) - Follow-up Verification: 92% success (only for follow-ups) ### **Workflow Verification Approach** ``` Predefined Workflow: 1. Identify appointment type 2. Collect appropriate information 3. Schedule appointment 4. Confirm details ``` **Benefits:** - Simple setup using existing agent workflow - Single compliance score for entire process - Easy to understand pass/fail results **Results Example:** - Overall Workflow Compliance: 89% success ## Implementation Guidance ### **Start With Workflow Verification If:** - Your agent has well-defined, linear workflows - You need quick evaluation setup - Simple compliance checking meets your needs - Your team prefers straightforward metrics --- ## Human Reviews Source: https://docs.coval.ai/concepts/metrics/human-review/human-review Learn how to perform human reviews of agent conversations and metrics in Coval ## Overview The human review workflow in Coval allows you to manually review conversations or runs to ensure quality and identify areas for improvement. This guide will walk you through the process of conducting reviews and providing actionable feedback. > **Info:** **Pro Tip:** Regular human reviews are essential for maintaining high-quality AI interactions and identifying areas for improvement in your agent's performance. [Video: Loom Video](https://www.loom.com/embed/460ff2faae254749af88bb32fc5c6c53) ## Getting Started with Human Review **Step: Select a Run or Conversation** ![Runs Page](/concepts/metrics/human-review/images/runs_page.png) 1. Navigate to either the Simulations or Conversations pages 2. Choose the specific run or conversation you want to review 3. Use keyboard shortcuts to navigate: - `j` / `k`, `w` / `s`, or `up` / `down` to move through rows - `Enter` to open the selected run or conversation - once you are in a result view, `h` / `l`, `a` / `d`, or `left` / `right` move between neighboring conversations **Step: Review the Content** ![Review Interface](/concepts/metrics/human-review/images/review_run.png) - Compare and review the agent's performance against the metrics - Determine the correct value based on your assessment - Update the metric if the automated value is incorrect - Add notes to the run to provide feedback - Notes can be dragged and positioned anywhere in the review interface **Step: Track Reviewed Content** ![Human Eval Page](/concepts/metrics/human-review/images/human_eval_page.png) - Reviewed or partially reviewed content automatically appears in the Human Eval page - View all your reviewed runs from both simulations and conversations ## Supported Metric Types Not all metrics support human review — only those with a defined annotation mechanism can be labeled in the review interface. Metrics fall into four categories based on how reviewers interact with them. ### Direct Value Metrics Reviewers provide a single value for the entire conversation using buttons, a number input, or a dropdown. #### Binary (Pass/Fail) Reviewers select **Yes**, **No**, or **N/A** using on-screen buttons or keyboard shortcuts. - Applies to: binary LLM judge metrics, audio binary judge, agent repeats itself #### Numerical Reviewers enter a number within a configured min/max range. - Applies to: numerical LLM judge, audio numerical judge #### Categorical Reviewers select from a configured list of categories using a dropdown. - Applies to: categorical LLM judge, audio categorical judge #### Transcript Sentiment Analysis Reviewers select a sentiment label (e.g. Rude, Polite, Encouraging, Professional) using category buttons. #### Composite Evaluation Reviewers assess each criterion individually using MET / NOT_MET / UNKNOWN toggles. --- ### Audio Region Metrics Reviewers mark or edit regions on an audio waveform timeline. These metrics require an audio recording to be present on the conversation. Includes: interruption rate, latency, abrupt pitch changes, volume/pitch misalignment, non-expressive pauses, vocal fry, music detection, time to first audio, volume variance, custom pause analysis, agent needs reprompting. --- ### Per-Segment Labeling Reviewers assign a label to each speaking segment in the conversation. - **Audio sentiment** — label each segment as Neutral, Angry, Happy, or Sad --- ### Per-Message Review Reviewers provide a value for each individual message in the transcript. - **Words per message** — count of words per assistant message ## Next Steps After reviewing runs, you can: 1. **Improve Your Agent** - Use the feedback to update prompts and capabilities - Run new simulations to test improvements 2. **Refine Your Metrics** - Test metric changes in simulations before deploying - Use create metrics to update or test new metrics 3. **Assign More Reviews** - Delegate runs to team members for additional review - Track review progress in the Human Eval page > **Info:** **Continuous Improvement:** Use these insights to iteratively enhance both your agent and metrics, creating a feedback loop that drives better performance. For the full keyboard model, see the [Keyboard Navigation guide](/guides/keyboard-navigation). --- ## Templates Source: https://docs.coval.ai/concepts/templates/overview Launch evaluations quickly with pre-saved configurations Templates let you save evaluation configurations—including agent, test set, persona, and metrics—so you can launch simulations consistently with one click. You can also schedule recurring evaluations from any template. ## Creating a Template Navigate to **Templates** in the sidebar, then click **New Template**. ### Configuration Steps The template creation form walks you through each component: **1. Select Agent** Choose the voice or chat agent you want to test. Your agent connection settings (phone number, websocket URL, etc.) are preserved from your agent configuration. **2. Select Persona(s)** Choose how the simulated user should behave. You can select multiple personas—each persona will create a separate run, letting you compare performance across different user types. **3. Select Test Set** Pick the test cases that define the conversation scenarios. These determine what the simulated user will say and do during the evaluation. **4. Set Iterations** Define how many times each test case runs. With 2 test cases and 3 iterations, you'll get 6 total conversations. **5. Set Concurrency** Control how many simulations run in parallel. Higher concurrency speeds up evaluation but may hit rate limits on your agent infrastructure. **6. Select Metrics** Choose which metrics to evaluate. These can be built-in metrics (latency, interruptions) or custom metrics you've created. **7. Select Mutations (Optional)** If you've set up [agent mutations](/concepts/agents/overview#agent-mutations), select which variants to test. Each mutation creates a separate run comparing the base agent against the mutated version. **8. Save Template** Click **Create Template** to save. Your template now appears in the templates list. ## Launching from a Template From the Templates list: 1. Find your template and click **Run Now** 2. Review the pre-filled configuration 3. Click **Launch Evaluation** to start immediately Or from the **Launch Evaluation** page: 1. Select **Use Template** 2. Choose your saved template 3. Customize any settings for this specific run 4. Launch ## Scheduling Recurring Evaluations Templates can power scheduled, recurring evaluations—useful for continuous monitoring and regression detection. ### Creating a Scheduled Run 1. From the Templates list, click **Schedule** on your template 2. Configure the schedule: - **Name**: Identify this scheduled job - **Frequency**: Hourly, daily, or weekly - **Start/End dates**: Optional window for the schedule 3. Review the template configuration that will be used 4. Click **Create Schedule** The scheduled run inherits all template settings—agent, personas, test set, metrics, and mutations. Each time the schedule triggers, it launches a new evaluation with those exact parameters. ### Managing Schedules View all scheduled runs in the **Scheduled** tab: - **Active**: Schedules currently running on their cadence - **Paused**: Temporarily disabled schedules - **Completed**: Schedules that reached their end date Click any schedule to see its run history, success rate, and trend metrics over time. ## Best Practices **Template Organization** - Create templates for each major workflow you test regularly - Name templates descriptively (e.g., "Disputes - Angry Customer Persona") - Use folders or naming conventions to group related templates **Scheduled Runs** - Start with daily schedules for active development - Use hourly only for high-traffic production monitoring - Set end dates for temporary testing periods **Mutation Testing** - Create templates with mutations to validate prompt changes - Compare base vs. mutated results before deploying changes ## Deprecated Features > **Warning:** The legacy "Scheduled Evaluations" feature has been removed. All recurring evaluations now use Templates with Scheduled Runs, which provides: > - Better visibility into configuration > - Consistent parameter inheritance > - Centralized management in the Templates section --- ## Simulations Source: https://docs.coval.ai/concepts/simulations/overview Simulate agent-user conversations and evaluate the results. When you launch a run, you trigger a simulation and subsequent evaluation of that simulation. Coval supports different simulation approaches: - **Text-based**: For chat agents using text inputs and outputs - **Voice-based**: For voice agents with audio inputs and outputs [Video: Loom Video](https://www.loom.com/embed/b47a9ab6f7a04c0baff2b8817882554f) ## **Setting Up an Evaluation** 1. Click "Launch Evaluation" 2. Select a template or configure manually: - Choose a test set - Select an agent to test - Select a persona - Choose metrics to track - Set simulation parameters - _(Optional)_ Add tags to label this run _(See "Templates" for more information)_ ## **Tagging Runs** You can add up to 20 tags to a run at launch time. Tags are useful for organizing and filtering runs — for example, by environment, release version, or test type. **From the UI:** A "Tags" card appears in the launch panel. Type a tag name and click **+** (or press Enter) to add it. Click the **×** on any tag chip to remove it. **Via the API:** Pass tags in the `metadata.tags` field of the launch request: ```json { "agent_id": "...", "persona_id": "...", "test_set_id": "...", "metadata": { "tags": ["regression", "v2.1", "nightly"] } } ``` Constraints: max 20 tags per run, each tag max 200 characters. After launch, you can filter simulations by tag using the `tag=` filter expression (e.g., `tag="regression"`). ## **Scheduling Recurring Evaluations** 1. Enable the "Schedule Recurring" option 2. Set frequency (hourly, daily, weekly) 3. Configure start and end dates if applicable 4. Set alert thresholds for specific metrics (in "Alerts") > **Info:** **Benefits of Recurring Evaluations:** > - Continuous monitoring of your agent's performance > - Early detection of regressions or issues > - Ability to set alerts when specific metrics underperform > - Historical performance tracking for trend analysis # **Analyzing Evaluation Results** A simulation is a simulated conversation between our agent and your voice or chat agent. You can define the environment on how to test your agent within test sets and Templates. Metrics define the success or failure criteria for your tests. [Video: Loom Video](https://www.loom.com/embed/2dd159a3a35f470ab67eb6f56e27321f?sid=b770fa47-f18f-423e-8ce0-3833e03be93a) ## Runs A Run is an evaluation. A Run can consist of multiple conversations (e.g., if the test set consists of multiple scenarios/transcripts). On each run, you will see the following set of actions: - [Resimulate](#resimulating-individual-simulations): Re-run one or more simulations in place if something looks off, or to confirm the performance of a specific metric - Rerun metrics: An LLM Judge metric doesn't perform as expected and you need to adjust it? Go back to the run and rerun that specific metric - Compare: Compare a run with any other run that was performed on the same test set - Human Review: Provide feedback on the run results and send it to the "Manual Review" for team members to collaborate on iterations - Share: share an internal or public link to your run results - a great way to use simulations as part of your sales process\! ![Docs Runresults Pn](/images/docs-runresults.png) Clicking on one call of this run will open your metric results in detail, allowing you to check your results in depth, detect where in the transcript your issues arise, and see detailed explanations for LLM Judge metrics. If [OpenTelemetry traces](/concepts/simulations/traces/opentelemetry) are available for the simulation, an **OTel Traces** card appears in the metric grid showing span count and linking to the trace viewer. ![Docs Runresults2 Pn](/images/docs-runresults2.png) ## Resimulating Individual Simulations If a single simulation looks off — a flaky agent response, a metric that didn't fire correctly, or results from before you updated your agent configuration — you can rerun it in place without launching a whole new run. **How to resimulate from the run results table:** 1. Open the run from your Runs list. 2. In the results table, use the checkboxes on the left to select one or more simulations. 3. Click the **Resimulate (N)** button in the bulk action bar. 4. Confirm in the dialog. The selected simulations are queued immediately and you'll see a toast confirming how many were accepted. **What resimulation does:** - Reruns each selected simulation against the **latest** agent, persona, test set, test case, metric, and agent mutation configuration. - Overwrites the existing simulation output (transcript, audio, tool calls) and metric results in place. - Keeps the original simulation and run records — the `simulation_id` and `run_id` are preserved so any dashboards, shares, or links pointing at them continue to work. - Runs asynchronously — the simulation's status will move from `IN_QUEUE` → `IN_PROGRESS` → back to a terminal status. Refresh the page to see progress. > **Note:** Because resimulation **overwrites** existing results, the original output for the selected simulations is lost. If you want to keep the old results for comparison, launch a new run instead. **When resimulation is rejected:** - The simulation is currently running or queued (wait for it to finish first). - The underlying agent, persona, test set, test case, or metric has been deleted since the original run. - The run belongs to a conversation upload (conversations can't be resimulated). In any of these cases, you'll see a toast with a specific reason and the affected simulations are skipped. Other selected simulations still queue normally. ## Overview The Overview tab consists of all individual conversations. It helps you get an overview of your agent's performance by creating your own summary graphs and see aggregated performance over time. ## Human Review Use Coval's Human-in-the-loop review capabilities to label runs for review. ## Deterministic Simulation Modes By default, the persona generates responses dynamically using an LLM. For cases where you need repeatable, deterministic persona behavior, Coval offers two additional test case input types: - **Audio Upload**: Upload a pre-recorded audio file (persona's side of the conversation) that plays back exactly as recorded instead of generating persona speech. The audio is automatically transcribed so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call. You can optionally attach a ground truth transcript to each test case to enable the [STT Word Error Rate (Audio Upload)](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) metric, which measures your agent's speech recognition accuracy against the known-correct transcript. See [Test Sets — Audio Upload](/concepts/test-sets/overview#3-audio-upload) for setup details. - **Scripted Turns**: Define an ordered list of exact lines for the persona to deliver turn by turn. The persona still uses the configured voice and background sounds, but speaks the scripted text instead of LLM-generated responses. A built-in divergence detector monitors agent responses and can end the simulation early if the agent goes off-track. See [Test Sets — Script](/concepts/test-sets/overview#4-script) for setup details. ## Image Attachments For WebSocket voice agents, Coval can also simulate image-sharing workflows by attaching one image to a test case and letting the persona send it during the conversation when the agent asks for visual context. This is useful for scenarios like submitting a receipt, showing a damaged item, or sending an ID photo after the agent requests it. **Requirements:** - The test case must include an image attachment. - The test set must be attached to a WebSocket voice agent. - The simulation must run against a WebSocket voice agent. - The agent must be configured with a compatible `send_media_template`. See [Test Sets — Image Attachment](/concepts/test-sets/overview#5-image-attachment) for test-case setup and [WebSocket](/concepts/agents/connections/websocket#media-send-template) for payload configuration. ## Simulation Time Limits Each simulated conversation has a maximum duration: | | Duration | |---|---| | **Default timeout** | 10 minutes | | **Maximum timeout** | 15 minutes | A simulation ends when the conversation reaches a natural conclusion, the test objective is met, or the timeout is reached — whichever comes first. > **Info:** If your agent requires longer conversations, contact [support@coval.dev](mailto:support@coval.dev) to discuss your use case. The hard maximum per simulation is 15 minutes. --- ## **Best Practices for your Evaluations:** > **Tip:** **Testing Strategy:** > - Start with core functionality test cases > - Expand to edge cases and failure scenarios > - Include regression tests for fixed issues > - Test across different user personas and scenarios > **Tip:** **Continuous Improvement:** > - Regularly update test sets based on production data > - Refine metrics as your understanding of agent performance evolves --- ## Multi-Run Analysis Source: https://docs.coval.ai/concepts/simulations/multi-run-analysis Compare and analyze multiple evaluation runs side-by-side in a single report. Multi-Run Analysis lets you bring multiple runs together into a single view so you can spot regressions, compare agent variants, and track metric trends across evaluations. Instead of flipping between individual run pages, you see all the data in one place—with color-coded grouping, aggregated statistics, and a shareable URL. ## When to use it Multi-Run Analysis is useful when you want to: - **Compare agent versions** — run the same test set against different agent builds and see which performs better across every metric - **Evaluate persona impact** — test the same agent against multiple personas and understand how user behavior affects outcomes ## Accessing Reports Navigate to **Reports** in the left sidebar ## Creating a Report There are two ways to start a new report. ### From the Runs list 1. Go to **Runs** in the sidebar. 2. Enable select mode and check the runs you want to analyze (up to 50 at a time). 3. Click **Multi-Run Report** — this opens the report builder pre-loaded with those run IDs in the URL (`?run_ids=...`). Alternatively, while the runs list has filters applied, click **Multi-Run Report with filters** to open a report that dynamically loads up to 50 runs matching the current filter state. The filter parameters are encoded in the URL, so the same link will resolve to the same runs when shared. ### From the Reports page Click **New Report** from the Reports page, then add run IDs manually or navigate there via the runs list flow above. ## Compare By The **Compare by** dropdown (top-right of the report) is the core analytical tool. It segments the data by a dimension you choose, then color-codes each segment so patterns are immediately visible in both the metric cards and the results table. | Option | What it segments by | |---|---| | **None** | No grouping — all rows shown together | | **Agent** | Groups rows by the agent that ran the simulation | | **Persona** | Groups rows by the persona used | | **Test case** | Groups rows by the specific test case input | | **Metadata** | Groups rows by a custom metadata key you specify | ## View Modes Once a Compare By dimension is selected, you can switch between two view modes: ### Row view Each simulation output appears as an individual row, color-coded by its Compare By group. This is the default. Use it when you want to inspect individual conversations or find outliers within a group. ### Grouped view Rows are collapsed into one row per group. Each group row shows aggregated metric scores for all simulations in that group. Use this when you want a high-level comparison across groups without the noise of individual results. The grouped view toolbar lets you toggle between five aggregation modes: | Mode | What it shows | |---|---| | **Average** | Mean score across all simulations in the group | | **Median** | Middle value — less sensitive to outliers than average | | **P95** | 95th percentile — useful for understanding worst-case performance | | **Min** | Lowest score in the group | | **Max** | Highest score in the group | Click a group row in the grouped view to expand it and see the individual simulation rows within that group. ## Filtering by Metric Clicking a metric card in the left pane filters the results table to show only the column for that metric, making it easier to focus on one score at a time. Click **All Metrics** in the breadcrumb to return to the full table. ## Saving a Report An unsaved report (opened from the runs list) shows a **Save Report** button in the header. Click it to save the current set of run IDs and view configuration (Compare By setting, view mode, and color overrides). After saving, the report gets a permanent ID and appears in the Reports list. On a saved report, the Save button is only active when the view configuration has changed from what was last saved. Click it to persist your latest configuration changes. To rename a saved report, click the pencil icon next to the report title and type a new name. Press Enter or click away to save. ## Sharing a Report Saved reports can be published for external sharing. 1. Open a saved report. 2. Click the **Share** button in the header. 3. Click **Publish shareable link** — this publishes all runs in the report and generates a public URL at `/shared/reports/{report_id}`. 4. Copy the link from the popover and share it. Anyone with the link can view the report without a Coval account. Published reports show a **Public** badge in the Reports list. To revoke access, open the Share popover and click **Unpublish all**. > **Info:** Reports copied from the Reports list actions menu show a warning if the report is still private — the link won't be accessible until the report is published. ## Deleting a Report From the Reports list, open the actions menu (three-dot icon) on any report row and select **Delete**. You'll be asked to confirm before the report is permanently removed. Deleting a report does not delete the underlying runs. --- ## OpenTelemetry Traces Source: https://docs.coval.ai/concepts/simulations/traces/opentelemetry Send traces from your agent to Coval using the OpenTelemetry SDK. > **Warning:** **Beta Feature** — Tracing with OpenTelemetry is currently in beta and under active development. Functionality and APIs may change as we continue to improve the experience. You can send traces from your agent to Coval using the [OpenTelemetry](https://opentelemetry.io/) SDK. This lets you capture detailed span data — such as tool calls, LLM invocations, and other operations — and export it directly to Coval for analysis alongside your simulation or conversation results. Tracing works for both **simulations** (where Coval calls your agent) and **conversations** (where you submit post-hoc call data). The setup differs only in how you identify the call — everything else (instrumentation, span naming, viewing) is the same. > **Tip:** **New to tracing?** If you're using Pipecat, LiveKit, or Vapi, the [Coval Wizard (Beta)](/concepts/simulations/traces/wizard) can instrument your agent automatically with one command: `npx @coval/wizard` > **Tip:** **Need a reviewable setup?** Use [Coval Tracing Skills](/concepts/simulations/traces/tracing-skills) to have your AI coding agent inspect your repo, propose a plan, add tracing, and validate one real Coval trace with a diff you can review. > **Tip:** **Already using Langfuse?** Skip instrumenting a second SDK — connect your Langfuse account once and Coval imports traces automatically for each simulation. See [Import Traces from Langfuse](/concepts/simulations/traces/langfuse-import). > **Tip:** **Already using Arize Phoenix?** Connect your Phoenix project once in Settings and Coval pulls spans after each simulation — Phoenix is OTel-native, so the integration is a thin fetch. See [Import Traces from Arize Phoenix](/concepts/simulations/traces/arize-import). ## **Prerequisites** - A Coval account with an API key ([manage your keys](https://app.coval.dev/settings)) - A simulation output ID (for simulations) or a conversation ID (for conversations) - Python 3.8+ with the OpenTelemetry SDK installed Install the required packages: ```bash pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http ``` ## **Configuration** Configure the OpenTelemetry tracer provider to export spans to Coval's trace ingestion endpoint: ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.resources import SERVICE_NAME, Resource # Configure tracer resource = Resource.create({SERVICE_NAME: "my-agent"}) provider = trace_sdk.TracerProvider(resource=resource) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": "", }, timeout=30, ) provider.add_span_processor(SimpleSpanProcessor(exporter)) tracer = provider.get_tracer("my-agent") ``` | Parameter | Description | |---|---| | `endpoint` | Coval's OTLP trace ingestion URL: `https://api.coval.dev/v1/traces` | | `X-API-Key` | Your Coval API key | | `X-Simulation-Id` | The **simulation output ID** for the individual call being traced. This is per-simulation-call, not the run ID. | | `timeout` | Export timeout in seconds. Must be set to `30` (see note below) | | `SERVICE_NAME` | A name identifying your agent service | > **Note:** The `timeout` parameter must be set to **30 seconds** to ensure spans are exported reliably. We are working on reducing this requirement in a future update. ## **Getting the Simulation Output ID** The `X-Simulation-Id` header must be set to the **simulation output ID** for the specific call you're tracing. The simulation output ID is a per-call identifier — different from the run ID. Here's how to obtain it at runtime. ### Inbound voice agents When Coval places an inbound call, it passes the simulation output ID as a SIP header: `X-Coval-Simulation-Id`. Read this header when the call arrives and use it to configure your OTLP exporter. ```python # Example: reading the simulation output ID from a SIP header # In your call.initiated webhook handler (Telnyx example): simulation_id = next( h["value"] for h in event["payload"]["sip_headers"] if h["name"] == "X-Coval-Simulation-Id" ) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": simulation_id, }, timeout=30, ) ``` See the [Inbound Voice guide](/guides/simulations/inbound-voice) for provider-specific instructions on reading SIP headers (Twilio SIP trunking, Telnyx, etc.). > **Note:** **Twilio Programmable Voice (PSTN)** — Standard Twilio phone numbers route over the public telephone network, which strips SIP headers. Use the `pre_call_webhook_url` agent config instead: Coval will POST the simulation ID to your agent before dialing. See the [Twilio ConversationRelay guide](/guides/simulations/twilio-conversationrelay). ### Outbound voice agents Coval's outbound trigger POST can include the simulation output ID in the request payload. Add `simulation_output_id` to your `trigger_call_payload` configuration in your template, then read it when your webhook receives the trigger and use it to configure the exporter. > **Tip:** You can also find simulation output IDs in the Coval dashboard under any run's results, or via the Coval API. ## **Tracing for Conversations** For [conversations](/concepts/conversations/overview) (post-hoc call evaluation), there is no Coval-initiated call, so there is no simulation output ID available at call time. Instead, you use a **conversation ID** to associate traces with a conversation. The conversation ID is only available *after* the call ends and you submit the transcript to Coval — which means you can't configure the OTLP exporter up front. The solution is to buffer spans in memory during the call, then flush them once you have the ID. **Step: Buffer spans during the call** Use `InMemorySpanExporter` (included in `opentelemetry-sdk`) to hold spans locally during the call instead of exporting them in real time. ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource resource = Resource.create({SERVICE_NAME: "my-agent"}) in_memory_exporter = InMemorySpanExporter() provider = trace_sdk.TracerProvider(resource=resource) provider.add_span_processor(SimpleSpanProcessor(in_memory_exporter)) tracer = provider.get_tracer("my-agent") # Instrument your agent as normal — spans accumulate in memory with tracer.start_as_current_span("llm") as span: span.set_attribute("metrics.ttfb", 0.42) response = call_llm() ``` **Step: Submit the conversation after the call ends** Post the transcript (and optionally audio) to `POST /v1/conversations:submit`. The response contains the `conversation_id` you need for trace export. ```python import requests response = requests.post( "https://api.coval.dev/v1/conversations:submit", headers={ "x-api-key": "", "Content-Type": "application/json", }, json={ "transcript": [ {"role": "user", "content": "Hello", "start_time": 0.0, "end_time": 1.2}, {"role": "assistant", "content": "Hi! How can I help?", "start_time": 1.5, "end_time": 3.0}, ], }, ) conversation_id = response.json()["conversation"]["conversation_id"] ``` See [`POST /v1/conversations:submit`](/api-reference/v1/conversations/conversations/submit-conversation-for-evaluation) for the full request schema including optional audio, metadata, and metrics fields. > **Tip:** If your recording URL isn't available at call end (common with Twilio Programmable Voice in multi-replica deployments), submit the transcript now to get a `conversation_id` for trace correlation, then attach the audio later with `PATCH /v1/conversations/{conversation_id}`. Text-only metrics fire after submit, audio metrics fire after PATCH. **Step: Export the buffered spans** Create an OTLP exporter with `X-Conversation-Id` and flush the buffered spans to Coval. ```python from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter otlp_exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Conversation-Id": conversation_id, }, timeout=30, ) finished_spans = in_memory_exporter.get_finished_spans() if finished_spans: otlp_exporter.export(list(finished_spans)) ``` | Parameter | Description | |---|---| | `X-Conversation-Id` | The `conversation_id` returned by `POST /v1/conversations:submit`. Use this **instead of** `X-Simulation-Id`. | > **Tip:** Traces can be sent immediately after submitting a conversation — no delay is needed. ### Full conversation tracing example ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource COVAL_API_KEY = "" # --- Call setup: buffer spans in memory --- resource = Resource.create({SERVICE_NAME: "my-agent"}) in_memory_exporter = InMemorySpanExporter() provider = trace_sdk.TracerProvider(resource=resource) provider.add_span_processor(SimpleSpanProcessor(in_memory_exporter)) tracer = provider.get_tracer("my-agent") # --- During the call: instrument as normal --- with tracer.start_as_current_span("llm") as span: span.set_attribute("metrics.ttfb", 0.42) response = call_llm() with tracer.start_as_current_span("tts") as span: span.set_attribute("metrics.ttfb", 0.18) audio = synthesize_speech(response) # --- After the call ends: submit transcript, then export spans --- submit_response = requests.post( "https://api.coval.dev/v1/conversations:submit", headers={"x-api-key": COVAL_API_KEY, "Content-Type": "application/json"}, json={"transcript": transcript}, ) conversation_id = submit_response.json()["conversation"]["conversation_id"] otlp_exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={"X-API-Key": COVAL_API_KEY, "X-Conversation-Id": conversation_id}, timeout=30, ) finished_spans = in_memory_exporter.get_finished_spans() if finished_spans: otlp_exporter.export(list(finished_spans)) ``` ### Uploading Traces via the Dashboard You can also upload traces directly from the Coval dashboard without using the SDK. In the **Conversations** page, click **Upload to Conversations** and: 1. Add your audio file or transcript as usual 2. In the **Traces (Optional)** section, select your OTLP traces JSON file (must contain a `resourceSpans` array) 3. Click **Upload** — the conversation and traces are submitted together This is useful for testing, debugging, or uploading historical traces that were captured separately. ## **Payload Limits & Batching** A single export request to `/v1/traces` has a size limit. Large buffered exports — most commonly the end-of-call flush in the conversation flow above — can exceed it and fail with `413 Request Entity Too Large`. Keep each export request under roughly **3–4 MB**. Treat this as a practical target, not a fixed contract: stay comfortably below it rather than tuning to an exact boundary. ### Splitting spans across requests You can split one call's spans across multiple export requests. Every request carrying the same `X-Conversation-Id` (or `X-Simulation-Id`) is merged server-side into a single trace, reconstructed from each span's parent/child relationships. There is no ordering requirement between requests. The simplest way to stay under the limit is `BatchSpanProcessor` with a bounded batch size, which chunks exports for you: ```python from opentelemetry.sdk.trace.export import BatchSpanProcessor provider.add_span_processor( BatchSpanProcessor(otlp_exporter, max_export_batch_size=512) ) ``` Lower `max_export_batch_size` if your spans carry large attributes such as full transcripts or prompts. > **Warning:** **Retry only the failed batch.** Spans are stored append-only with no de-duplication. If an export fails, resend only that batch — re-sending batches that already succeeded will duplicate spans in the trace view and double-count trace-based metrics. > **Note:** Spans can arrive before `POST /v1/conversations:submit` has finished registering the conversation. They are still attributed correctly and reconcile automatically — no special handling needed on your side. ## **Instrumenting Your Agent** Once the tracer is configured, wrap operations in spans to capture trace data: ```python # Use tracer in agent code with tracer.start_as_current_span("llm_tool_call") as span: span.set_attribute("function.name", "search_database") span.set_attribute("tool_call_id", "call_123") result = call_tool() ``` You can nest spans to capture the full call hierarchy of your agent — for example, a parent span for the overall request and child spans for individual tool calls or LLM invocations. > **Info:** **Shutdown** — Call `provider.shutdown()` when your agent exits. With `SimpleSpanProcessor`, spans are exported synchronously as each span ends (not buffered), so they are already in Coval before shutdown is called. Shutdown is still good practice for clean resource teardown. ```python # Call on agent exit for clean resource teardown. provider.shutdown() ``` ## **Span Naming Conventions** Coval's trace viewer applies semantic colors and labels to well-known span names. Using these names gives a richer experience in the UI and enables built-in trace metrics. | Span Name | Use For | Required Attributes | Optional / Recommended Attributes | Accepted Compatibility Aliases | |-----------|---------|---------------------|-----------------------------------|--------------------------------| | `llm` | LLM invocations | — | `metrics.ttfb` (seconds), `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `llm.finish_reason` (`stop`, `tool_calls`, `length`, `content_filter`) | — | | `tts` | Text-to-Speech | — | `metrics.ttfb` (seconds) | — | | `stt` | Speech-to-Text | `transcript` when using [STT Word Error Rate](/concepts/metrics/built-in-metrics#stt-word-error-rate) or the [Audio Upload variant](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) | `metrics.ttfb` (seconds), `stt.confidence` (ASR confidence 0.0-1.0) | `stt.transcription` is accepted by STT WER for older integrations, but new integrations should emit `transcript` | | `stt.provider.` | Per-provider STT attempt (child of `stt`) | — | `stt.providerName`, `stt.confidence`, `metrics.ttfb` | — | | `vad` | Voice Activity Detection | — | — | — | | `llm_tool_call` | Individual tool/function calls | — | `function.name`, `tool_call_id`, `function.arguments` | Span name `tool_call`; attributes `tool.name`, `tool.call_id`, `tool.arguments` | | `turn` | A single conversation turn | — | — | — | | `conversation` | Full conversation | — | — | — | | `pipeline` | Processing pipeline | — | — | — | | `transport` | Audio/network transport | — | — | — | Any span name works — spans with names not listed above will still appear in the UI with auto-assigned colors. Use `service.name` in your `Resource` to group spans by service. > **Info:** For complete working implementations, see the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents) on GitHub — Vapi, Pipecat, and LiveKit agents that emit the full span schema. ## **Instrumenting STT Spans** To use the [STT Word Error Rate](/concepts/metrics/built-in-metrics#stt-word-error-rate) metric (or its [Audio Upload](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) variant), your agent must emit `stt` spans with a `transcript` attribute containing the transcribed text. This is what allows Coval to compare your agent's STT output against a reference transcript. Coval also accepts the older `stt.transcription` alias for compatibility, but `transcript` is the canonical attribute for new integrations. We also recommend attaching `stt.confidence` when your STT provider exposes a per-utterance confidence score. Here is an example using the [Pipecat](https://github.com/pipecat-ai/pipecat) framework: ```python from opentelemetry import trace as otel_trace from pipecat.services.deepgram.stt import DeepgramSTTService from pipecat.utils.tracing.service_decorators import traced_stt def _read_path(value, *path): current = value for segment in path: if current is None: return None if isinstance(segment, int): if isinstance(current, (list, tuple)) and 0 <= segment < len(current): current = current[segment] else: return None continue if isinstance(current, dict): current = current.get(segment) else: current = getattr(current, segment, None) return current def extract_stt_confidence(result): confidence = _read_path(result, "channel", "alternatives", 0, "confidence") if confidence is None: return None normalized = float(confidence) if 0.0 <= normalized <= 1.0: return round(normalized, 4) return None class CovalDeepgramSTTService(DeepgramSTTService): """Adds stt.confidence to Pipecat's built-in traced `stt` spans.""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._current_stt_confidence = None async def _on_message(self, message): is_final = bool(getattr(message, "is_final", False)) self._current_stt_confidence = extract_stt_confidence(message) if is_final else None try: await super()._on_message(message) finally: if is_final: self._current_stt_confidence = None @traced_stt async def _handle_transcription(self, transcript, is_final, language=None): if is_final and self._current_stt_confidence is not None: span = otel_trace.get_current_span() if span.is_recording(): span.set_attribute("stt.confidence", self._current_stt_confidence) ``` Instantiate the subclass in your pipeline. With `PipelineTask(..., enable_tracing=True)`, Pipecat still emits the standard `stt` span, and the subclass adds `stt.confidence` onto that same span: ```python stt = CovalDeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY")) pipeline = Pipeline([ transport.input(), stt, context_aggregator.user(), llm, tts, transport.output(), ]) ``` For non-Pipecat agents, emit equivalent spans wherever your STT returns final transcriptions: ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-stt-instrumentation") with tracer.start_as_current_span("stt") as span: span.set_attribute("transcript", transcription_text) if confidence_score is not None: span.set_attribute("stt.confidence", confidence_score) ``` The span must be named `"stt"` and include the `transcript` attribute with the transcribed text. `stt.confidence` is optional, but when present it should be a 0.0-1.0 score for the final utterance. ## **Instrumenting LLM Spans** Include `llm.finish_reason` on `llm` spans so you can tell why the model stopped generating. This is especially useful when debugging responses that were silently cut off because `llm.finish_reason=length`. Here is a Pipecat example that enriches the built-in traced `llm` span: ```python from opentelemetry import trace as otel_trace from pipecat.services.openai.llm import OpenAILLMService def _read_path(value, *path): current = value for segment in path: if current is None: return None if isinstance(segment, int): if isinstance(current, (list, tuple)) and 0 <= segment < len(current): current = current[segment] else: return None continue if isinstance(current, dict): current = current.get(segment) else: current = getattr(current, segment, None) return current class _FinishReasonTrackingStream: def __init__(self, stream): self._stream = stream self._iter = stream.__aiter__() def __aiter__(self): return self async def __anext__(self): chunk = await self._iter.__anext__() finish_reason = _read_path(chunk, "choices", 0, "finish_reason") if finish_reason is not None: span = otel_trace.get_current_span() if span.is_recording(): span.set_attribute("llm.finish_reason", str(finish_reason)) return chunk async def aclose(self): if hasattr(self._iter, "aclose"): await self._iter.aclose() elif hasattr(self._stream, "aclose"): await self._stream.aclose() async def close(self): if hasattr(self._stream, "close"): await self._stream.close() else: await self.aclose() class CovalOpenAILLMService(OpenAILLMService): """Adds llm.finish_reason to Pipecat's built-in traced `llm` spans.""" async def get_chat_completions(self, params_from_context): stream = await super().get_chat_completions(params_from_context) return _FinishReasonTrackingStream(stream) ``` For non-Pipecat agents, set the attribute directly on your `llm` span after the provider response finishes: ```python with tracer.start_as_current_span("llm") as span: response = client.responses.create(...) if response.finish_reason: span.set_attribute("llm.finish_reason", response.finish_reason) ``` Common values include `stop`, `length`, `tool_calls`, and `content_filter`. ## **Provider Fallback Spans** Many voice agents use a provider fallback chain for STT — for example, Deepgram → Google → Azure. Without per-provider spans, a single `stt` span only shows the final result; there is no visibility into which provider served the call, how long each attempt took, or why a fallback triggered. The convention is to create one `stt.provider.` child span per provider attempt, nested inside the parent `stt` span: ``` stt ← parent span: final result └── stt.provider.deepgram ← attempt 1 (succeeded) ``` Or for a fallback: ``` stt ← parent span: final result ├── stt.provider.deepgram ← attempt 1 (failed, span status = ERROR) └── stt.provider.google ← attempt 2 (succeeded) ``` ### Span attributes | Attribute | Type | Description | |-----------|------|-------------| | `stt.providerName` | string | Provider name, e.g. `"deepgram"`, `"google"`, `"azure"` | | `stt.confidence` | float | ASR confidence score from this provider (0.0–1.0) | | `metrics.ttfb` | float | Time to first byte for this provider attempt (seconds) | ### Code example ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-stt-instrumentation") def transcribe_with_fallback(audio): providers = [("deepgram", deepgram_stt), ("google", google_stt)] final_transcript = None with tracer.start_as_current_span("stt") as stt_span: for provider_name, stt_fn in providers: attempt_start = time.time() with tracer.start_as_current_span(f"stt.provider.{provider_name}") as provider_span: provider_span.set_attribute("stt.providerName", provider_name) try: result = stt_fn(audio) ttfb = time.time() - attempt_start provider_span.set_attribute("metrics.ttfb", round(ttfb, 4)) confidence = getattr(result, "confidence", None) if confidence is not None: provider_span.set_attribute("stt.confidence", confidence) final_transcript = result.transcript break # success — stop trying fallbacks except Exception as e: provider_span.set_status(otel_trace.StatusCode.ERROR, str(e)) if final_transcript: stt_span.set_attribute("transcript", final_transcript) return final_transcript ``` ## **Viewing Traces in Coval** After a simulation completes or conversation traces are received, an **OTel Traces** card automatically appears in the metric grid on the result page when trace data is available. The card shows the total span count and a **View Traces** button that navigates directly to the trace viewer. To view traces: open a run or conversation result, click into a result, and click the OTel Traces card. You can also navigate directly via URL: ``` https://app.coval.dev//runs//results//traces ``` Traces appear within a few seconds of the simulation completing or being submitted. ### Trace viewer features The trace viewer has two visualization modes you can switch between using the toggle in the header: **Waterfall view** — Shows spans as horizontal bars on a timeline, nested by parent-child relationships. Use the collapse/expand controls to focus on specific parts of the call hierarchy. You can filter by span type using the color-coded legend pills in the header. **Flame graph view** — Shows all spans stacked by depth, giving a birds-eye view of where time is spent. Interactions include: - **Scroll** to pan the timeline left/right - **Ctrl/Cmd + scroll** to zoom in and out - **Drag-select** a region to zoom into that time range - **Double-click** a span to zoom to fit that span's duration - **Press F** to reset the view to fit the full trace - A **mini-map** above the flame graph shows the full trace with your current viewport highlighted — drag it to pan quickly In both views, clicking any span opens a **detail panel** on the right showing the span's attributes, timing, status, and parent chain. When no span is selected, the detail panel shows a **trace summary** with total spans, duration, span type breakdown with time percentages, slowest spans, and any error spans. ## **Transition Hotspots** Transition Hotspots give you a run-level view of how conversations flow through your agent's states — and where they fail. Rather than inspecting individual simulations one by one, you can see the full distribution of state-to-state transitions across an entire run at a glance. ### Walkthrough [Video: Loom Video](https://www.loom.com/embed/22d81a41276340f4b7fb42609dc455bc) ### Accessing Transition Hotspots The **Hotspots** tab appears on the run results page when at least one simulation in the run has OTel trace data. Navigate to a run, then click the **Hotspots** tab. If the tab is not visible, the run does not contain any traced simulations. You can also access it directly via the `?view=hotspots` query parameter on the run results URL. ### Reading the Heatmap The Hotspots view displays a heatmap matrix where: - **Rows** represent the origin state of a transition (the "from" state) - **Columns** represent the destination state (the "to" state) - **Each cell** represents a pair of states — for example, "greeting → account_lookup" Toggle between two views using the buttons in the header: | View | Description | |------|-------------| | **Counts** | Each cell shows how many times that state-to-state transition occurred across all simulations in the run | | **Failure Rate** | Each cell shows the percentage of simulations that failed when hitting that transition | Darker cells indicate higher counts or higher failure rates, depending on the active view. ### Drilling Down Click any cell in the heatmap to open a detail panel showing: - The total count and failure count for that transition - **Exemplar simulations** — individual simulations that passed through that state transition, with direct links to review them Use exemplars to understand why a particular transition has a high failure rate: open a failing simulation and inspect the transcript and trace together. ### Top Hotspots Sidebar The **Top Hotspots** sidebar ranks state transitions by failure count, making it easy to find the most impactful problems without scanning the full matrix. The top-ranked transitions are the ones where the most simulations failed. ### Span Filters Use the **span type filters** to include or exclude specific span types from the transition analysis. Wrapper spans — such as `conversation`, `pipeline`, `transport`, and `session:*` spans — are automatically collapsed and filtered by default, so the heatmap focuses on the meaningful transitions within your agent's processing logic. > **Tip:** Start with the **Failure Rate** view to find which transitions are most problematic, then switch to **Counts** to understand the volume. A transition with a 100% failure rate but only 1 occurrence is less concerning than one with a 30% failure rate across 50 simulations. ## **Full Example** ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.resources import SERVICE_NAME, Resource # Configure tracer resource = Resource.create({SERVICE_NAME: "my-agent"}) provider = trace_sdk.TracerProvider(resource=resource) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": "", }, timeout=30, ) provider.add_span_processor(SimpleSpanProcessor(exporter)) tracer = provider.get_tracer("my-agent") # Use tracer in agent code with tracer.start_as_current_span("llm_tool_call") as span: span.set_attribute("function.name", "search_database") span.set_attribute("tool_call_id", "call_123") result = call_tool() # Call on agent exit for clean resource teardown. provider.shutdown() ``` ## Using Span Attributes in Custom Metrics Any numeric span attribute your agent emits can be measured using a **Custom Trace Metric** (`METRIC_CUSTOM_TRACE`). This lets you track latency, token counts, or any other numeric value from your traces without writing custom evaluation code. To create a custom trace metric, specify: - **Span Name** — the `span_name` of the spans to aggregate (e.g. `llm`, `tts`, or any custom span you create) - **Metric Attribute** — the span attribute key containing the numeric value (e.g. `metrics.ttfb`, `token_count`) - **Aggregation Method** — how to aggregate across turns: `average`, `median`, `p90`, `p95`, `p99`, `max`, `min`, `sum`, `count`, `error_rate`, or `success_rate` See [Create Metric](/api-reference/v1/metrics/metrics/create-metric) for the full API reference. --- ## Import Traces from Langfuse Source: https://docs.coval.ai/concepts/simulations/traces/langfuse-import Connect your Langfuse account and Coval will automatically import traces into the viewer, metrics, and heatmap — no code changes required. If your agent already exports traces to [Langfuse](https://langfuse.com/), Coval can pull those traces into the trace viewer, run trace-based metrics (timings, LLM judges, custom metrics), and populate the transition heatmap — without re-instrumenting your agent. Connect your Langfuse account once in settings, and Coval handles the rest automatically after each simulation. ## **Walkthrough** [Video: Langfuse Integration Walkthrough](https://www.loom.com/embed/362d0ac342b84f75bd3ac148bdf5ec19) ## **How it works** 1. Your agent sends traces to Langfuse as it does today. 2. When a Coval simulation finishes, Coval fetches the traces that fall inside the simulation's time window from Langfuse's Public API. 3. Langfuse observations are mapped to OpenTelemetry spans and written to the same ClickHouse-backed trace store that native OTLP ingestion uses. 4. The [trace viewer](/concepts/simulations/traces/opentelemetry), trace metrics, and transition heatmap work against the imported spans exactly as they do for native OTLP traces. Imported spans are tagged with `service.name = langfuse` so they are easy to distinguish in the trace viewer. ## **Prerequisites** - A Coval account ([sign up](https://app.coval.dev)) - A [Langfuse](https://langfuse.com) project with at least one completed trace - Langfuse **Public Key** and **Secret Key** from your Langfuse project settings ## **Connect Langfuse** 1. Open **Settings → Integrations** in Coval. 2. Expand the **Langfuse Integration** panel. 3. Paste your **Public Key** and **Secret Key**. 4. Set **Host** if you self-host Langfuse. Leave it on the default `https://us.cloud.langfuse.com` for Langfuse Cloud. 5. Save. > **Note:** Coval stores the secret key server-side and never returns it to the browser. To rotate the key, use the **Replace key** button in the credentials card. ## **Correlation** To tie traces back to the right Coval simulation, Coval first tries to match Langfuse trace `metadata` on any of: - `simulation_output_id` - `session_id` - `coval_simulation_output_id` If your agent already sets a session ID equal to the Coval simulation output ID, import is precise. Otherwise, Coval falls back to the simulation's time window (`simulation.start_time` → `simulation.end_time`) and imports every trace in that window. For agents that handle more than one conversation at a time, set one of the metadata keys above to the simulation output ID so imports stay precise. ```python from langfuse import Langfuse langfuse = Langfuse() with langfuse.start_as_current_span( name="turn", metadata={"simulation_output_id": simulation_output_id}, ) as span: ... ``` ## **Verify traces landed** After a simulation finishes, open the result in Coval and click **View Traces**. Imported spans appear with `service.name = langfuse` and the original Langfuse attributes preserved under `langfuse.*` keys. OTel GenAI semantic-convention attributes (`gen_ai.request.model`, `gen_ai.usage.*`) are emitted for LLM generations so trace-based metrics work out-of-the-box. ## **Limits** - Import runs once per simulation, synchronously, with a 30-second budget. - Up to 500 traces per simulation time window are imported (100/page × 5 pages). - If a simulation already has native OTLP traces, the import is skipped to avoid duplicate spans. ## **Troubleshooting** | Symptom | Likely cause | |---|---| | No spans in the viewer, correct time window | Check the **Langfuse Integration** card in Settings. If the **Configured** chip is missing, re-save the credentials. | | Spans appear but don't match the simulation | Set `metadata.simulation_output_id` on your traces (see Correlation above). | | `401 Unauthorized` in logs | Keys were rotated in Langfuse. Click **Replace key** in Settings. | ## **See also** - [OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry) — push traces directly to Coval. - [Coval Wizard (Beta)](/concepts/simulations/traces/wizard) — auto-instrument Pipecat/LiveKit/Vapi agents. --- ## Import Traces from Arize Phoenix Source: https://docs.coval.ai/concepts/simulations/traces/arize-import Connect your Arize Phoenix project and Coval will automatically import traces into the viewer, metrics, and heatmap — no code changes required. If your agent already exports traces to [Arize Phoenix](https://docs.arize.com/phoenix), Coval can pull those spans into the trace viewer, run trace-based metrics (timings, LLM judges, custom metrics), and populate the transition heatmap — without re-instrumenting your agent. Connect your Phoenix project once in settings, and Coval handles the rest automatically after each simulation. > **Note:** **Arize AX cloud support is on the roadmap.** AX's `/v2/spans` REST endpoint indexes 15–30 minutes after OTLP ingest, which doesn't fit Coval's per-simulation 90-second import window. AX support will land alongside a delayed-retry worker. For now, Coval imports from Phoenix (which exposes spans the moment they land). ## **How it works** 1. Your agent sends spans to Phoenix as it does today. 2. When a Coval simulation finishes, Coval fetches the spans that fall inside the simulation's time window via `GET /v1/projects/{project}/spans/otlpv1`. 3. Spans are normalized to OpenTelemetry shape and written to the same ClickHouse-backed trace store that native OTLP ingestion uses. 4. The [trace viewer](/concepts/simulations/traces/opentelemetry), trace metrics, and transition heatmap work against the imported spans exactly as they do for native OTLP traces. Imported spans are tagged with `service.name = arize` so they are easy to distinguish in the trace viewer. ## **Prerequisites** - A Coval account ([sign up](https://app.coval.dev)) - A self-hosted [Phoenix](https://docs.arize.com/phoenix) deployment reachable from Coval's workers, or a Phoenix Cloud account, with a project name you write spans to ## **Connect Phoenix** 1. Open **Settings → Integrations** in Coval. 2. Expand the **Arize Integration** panel. 3. Fill in the required fields and save. | Field | Required | Notes | |---|---|---| | Phoenix Host | Yes | e.g. `https://app.phoenix.arize.com` for Phoenix Cloud, or your self-hosted base URL. | | Project Name | Yes | The Phoenix project name your agent writes to. | | API Key | Optional | Required for Phoenix Cloud, ignored for unauthenticated self-hosted Phoenix. | > **Note:** Coval stores the API key server-side and never returns it to the browser. To rotate the key, use the **Replace key** button in the credentials card. ## **Correlation** To tie spans back to the right Coval simulation, Coval matches span attributes on any of: - `simulation_output_id` - `session_id` - `session.id` - `coval_simulation_output_id` Coval also parses the `metadata` span attribute as JSON when present and matches the same keys inside it. If your agent already sets a session ID equal to the Coval simulation output ID, import is precise — otherwise, set one of the keys above so imports stay correct under concurrent calls. Both Arize OpenInference SDKs and the OpenTelemetry SDK make this trivial — just set the attribute on your root span. ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-agent") with tracer.start_as_current_span("conversation") as span: span.set_attribute("simulation_output_id", simulation_output_id) ... ``` For OpenInference instrumentation, the equivalent is: ```python from openinference.semconv.trace import SpanAttributes span.set_attribute(SpanAttributes.SESSION_ID, simulation_output_id) ``` > **Note:** Coval fails closed when a correlation hint is set but no spans match — it returns an empty result rather than importing every span in the time window. This avoids cross-contamination between concurrent simulations on the same project. ## **Verify spans landed** After a simulation finishes, open the result in Coval and click **View Traces**. Imported spans appear with `service.name = arize` and the original Phoenix / OpenInference attributes preserved. LLM spans are normalized to `span_name = "llm"` (matching Coval's native metric queries) regardless of the original Phoenix span name. Coval also aliases the common Arize / OpenInference token-count attributes to OTel GenAI semantic conventions: | Arize / OpenInference | OTel GenAI alias | |---|---| | `llm.token_count.prompt` | `gen_ai.usage.input_tokens` | | `llm.token_count.completion` | `gen_ai.usage.output_tokens` | | `input.value` | `input` | | `output.value` | `output` | So token usage and LLM-judge metrics work without any extra mapping. ## **Limits** - Import runs once per simulation, synchronously, with a 90-second budget that includes a brief retry-on-empty so spans flushed slightly after the simulation ends are still picked up. - Up to 5,000 spans per simulation are imported (1,000/page × 5 pages). - If a simulation already has native OTLP traces, the import is skipped to avoid duplicate spans. - Self-hosted Phoenix must be reachable from Coval's workers. If your Phoenix instance lives behind a private network, expose it via a public ingress. ## **Troubleshooting** | Symptom | Likely cause | |---|---| | No spans in the viewer, correct time window | Check the **Arize Integration** card in Settings. If the **Configured** chip is missing, re-save the credentials. Confirm the **Project Name** matches what your agent writes to. | | Spans appear in Phoenix but not in Coval | Set `simulation_output_id` (or `session_id` / `session.id`) on your root span — see Correlation above. | | `401 Unauthorized` in logs | API key was rotated. Click **Replace key** in Settings. | | `Connection refused` to a self-hosted Phoenix | Phoenix isn't publicly reachable from Coval's workers. Expose Phoenix on a routable URL. | ## **See also** - [OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry) — push traces directly to Coval. - [Import Traces from Langfuse](/concepts/simulations/traces/langfuse-import) — same flow, Langfuse source. - [Coval Wizard (Beta)](/concepts/simulations/traces/wizard) — auto-instrument Pipecat/LiveKit/Vapi agents. --- ## Import Traces from LangSmith Source: https://docs.coval.ai/concepts/simulations/traces/langsmith-import Connect your LangSmith project and Coval will automatically import traces into the viewer, metrics, and heatmap — no code changes required. If your agent already exports traces to [LangSmith](https://docs.langchain.com/langsmith), Coval can pull those runs into the trace viewer, run trace-based metrics (timings, LLM judges, custom metrics), and populate the transition heatmap — without re-instrumenting your agent. Connect your LangSmith project once in settings, and Coval handles the rest automatically after each simulation. ## **How it works** 1. Your agent sends runs to LangSmith as it does today (LangChain SDK auto-instrumentation, or any OpenTelemetry exporter pointed at `https://api.smith.langchain.com/otel/v1/traces`). 2. When a Coval simulation finishes, Coval resolves your LangSmith project name to its UUID via `GET /sessions`, then queries `POST /runs/query` with a `start_time` window and a `has(metadata, ...)` filter on `simulation_output_id`. 3. For every trace that had at least one direct metadata match, Coval issues a follow-up `POST /runs/query` filtered by `trace_id` to pull in the rest of the runs in that trace — so tagging only the root run still imports its descendants. Untagged traces in the window are never fetched. 4. Each LangSmith Run is normalized to an OpenTelemetry span and written to the same ClickHouse-backed trace store that native OTLP ingestion uses. 5. The [trace viewer](/concepts/simulations/traces/opentelemetry), trace metrics, and transition heatmap work against the imported spans exactly as they do for native OTLP traces. Imported spans are tagged with `service.name = langsmith` so they are easy to distinguish in the trace viewer. ## **Prerequisites** - A Coval account ([sign up](https://app.coval.dev)) - A LangSmith account with a project your agent writes runs to - A LangSmith API key (Settings → API Keys in the LangSmith UI — `lsv2_pt_...`) ## **Connect LangSmith** 1. Open **Settings → Integrations** in Coval. 2. Expand the **LangSmith Integration** panel. 3. Fill in the required fields and save. | Field | Required | Notes | |---|---|---| | API Key | Yes | LangSmith API key (`lsv2_pt_...`). Workspace-scoped is fine. | | Project Name | Yes | The LangSmith project name (a.k.a. session name) your agent writes to. | | Host | Yes | Defaults to `https://api.smith.langchain.com`. Use `https://eu.api.smith.langchain.com` for EU GCP or `https://aws.api.smith.langchain.com` for AWS US. | > **Note:** Coval stores the API key server-side and never returns it to the browser. To rotate the key, use the **Replace key** button in the credentials card. ## **Correlation** To tie runs back to the right Coval simulation, set `simulation_output_id` (or `session_id` / `coval_simulation_output_id`) in the run's `metadata`. Coval's filter expression matches on metadata key/value pairs server-side, then verifies the match client-side as a fail-closed safety net. LangChain SDK: ```python from langsmith import traceable @traceable(metadata={"simulation_output_id": simulation_output_id}) def handle_call(payload): ... ``` Or set metadata at trace start: ```python from langsmith.run_helpers import trace with trace( "conversation", metadata={"simulation_output_id": simulation_output_id}, ) as run_tree: ... ``` OpenTelemetry SDK exporting to LangSmith's `/otel/v1/traces`: ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-agent") with tracer.start_as_current_span("conversation") as span: # LangSmith's OTel ingestion only maps attributes prefixed with # `langsmith.metadata.` into the run `metadata` field that # `/runs/query` filters on. span.set_attribute("langsmith.metadata.simulation_output_id", simulation_output_id) ... ``` When the root run carries the metadata, Coval also pulls in its child runs in the same trace — you don't need to tag every span individually. > **Note:** Coval fails closed when a correlation hint is set but no runs match — it returns an empty result rather than importing every run in the time window. This avoids cross-contamination between concurrent simulations on the same project. ## **Verify runs landed** After a simulation finishes, open the result in Coval and click **View Traces**. Imported spans appear with `service.name = langsmith` and the original LangSmith run attributes preserved. LLM runs (`run_type = "llm"`) are normalized to `span_name = "llm"` (matching Coval's native metric queries) regardless of the original LangSmith run name. Coval also extracts token counts from the various places LangSmith stores them (`outputs.usage_metadata`, `outputs.llm_output.token_usage`, `extra.metadata.usage`) and exposes them as OTel GenAI semantic conventions: | LangSmith field | OTel GenAI alias | |---|---| | `outputs.usage_metadata.input_tokens` (or `prompt_tokens`) | `gen_ai.usage.input_tokens` | | `outputs.usage_metadata.output_tokens` (or `completion_tokens`) | `gen_ai.usage.output_tokens` | | `outputs.usage_metadata.total_tokens` | `gen_ai.usage.total_tokens` | | `extra.metadata.ls_model_name` (or `extra.invocation_params.model`) | `gen_ai.request.model` | | `inputs` | `input` | | `outputs` | `output` | So token usage and LLM-judge metrics work without any extra mapping. ## **Limits** - Import runs once per simulation, synchronously, with a 90-second budget that includes a brief retry-on-empty so runs flushed slightly after the simulation ends are still picked up. - Up to 500 runs per simulation are imported (100/page × 5 pages). - If a simulation already has native OTLP traces, the import is skipped to avoid duplicate spans. - LangSmith's `/runs/query` endpoint is available on all plans. Bulk export to S3 (Plus/Enterprise only) is not used. ## **Troubleshooting** | Symptom | Likely cause | |---|---| | No spans in the viewer, correct time window | Check the **LangSmith Integration** card in Settings. If the **Configured** chip is missing, re-save the credentials. Confirm the **Project Name** matches the project your agent writes to. | | Spans appear in LangSmith but not in Coval | Set `metadata.simulation_output_id` on the root run — see Correlation above. | | `401 Unauthorized` in logs | API key was rotated or revoked. Click **Replace key** in Settings. | | `404` on `/sessions` | The configured Host is wrong for your account region. Try `https://eu.api.smith.langchain.com` (EU) or `https://aws.api.smith.langchain.com` (AWS US). | ## **See also** - [OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry) — push traces directly to Coval. - [Import Traces from Langfuse](/concepts/simulations/traces/langfuse-import) — same flow, Langfuse source. - [Import Traces from Arize Phoenix](/concepts/simulations/traces/arize-import) — same flow, Arize/Phoenix source. - [Coval Wizard (Beta)](/concepts/simulations/traces/wizard) — auto-instrument Pipecat/LiveKit/Vapi agents. --- ## Tracing Wizard Source: https://docs.coval.ai/concepts/simulations/traces/wizard Automatically add Coval OTel tracing to your Python voice agent with one command. > **Warning:** The Coval Wizard is in beta and under active development — more features coming soon! It uses an LLM to analyze and modify your code, so results may vary. Always review the proposed diff carefully before applying changes. The Coval Wizard ([`@coval/wizard`](https://www.npmjs.com/package/@coval/wizard)) reads your Python agent code, figures out exactly where to inject tracing, and writes the changes for you — including a diff preview, file backup, and connectivity validation. It works for [Pipecat](https://pipecat.ai), [LiveKit Agents](https://docs.livekit.io/agents/), [Vapi](https://vapi.ai), and generic Python agents. > **Tip:** If your team cannot run a one-command code modifier, use [Coval Tracing Skills](/concepts/simulations/traces/tracing-skills) instead. The skills give your AI coding agent a reviewable, prompt-driven setup flow for tracing, trace optimization, trace metrics, and debugging. ## **Quick Start** Run this from your agent's project directory: ```bash npx @coval/wizard ``` The wizard will prompt you for your Coval API key if `COVAL_API_KEY` is not already set in your environment. ```bash # With API key pre-set COVAL_API_KEY=your-key npx @coval/wizard ``` ## **What It Does** **Step: Detects your project** Scans the directory for a Python project manifest (`pyproject.toml`, `requirements.txt`, `Pipfile`, or `setup.py`) and identifies your framework (Pipecat, LiveKit, Vapi, or generic Python). **Step: Analyzes your code** Sends your entry point file to an LLM along with framework-specific injection rules to determine the minimal changes needed. **Step: Shows a diff** Displays a colored diff of the proposed changes to your entry point and asks for confirmation before writing anything. **Step: Writes the files** Creates `coval_tracing.py` — a self-contained OpenTelemetry module — and modifies your entry point. Your original entry point is backed up to `.bak`. **Step: Validates connectivity** Sends a test span to `api.coval.dev` to confirm your API key is working and spans can reach Coval. If you don't have an API key yet, go to **Settings** in the Coval platform and click **Create Key**. ## **Supported Frameworks** | Framework | Detection | What gets injected | |-----------|-----------|-------------------| | **Pipecat** | `pipecat-ai` in dependencies | `setup_coval_tracing()` before pipeline construction; simulation ID from `args.body` SIP headers; `enable_metrics=True, enable_tracing=True` on `PipelineTask` | | **LiveKit Agents** | `livekit-agents` in dependencies | `setup_coval_tracing()` before `AgentSession`; simulation ID from SIP participant attributes; `instrument_session(session)` after `await session.start()` | | **Vapi** | Vapi webhook patterns in `.py` files | `setup_coval_tracing()` at module level; simulation ID extracted from `assistantOverrides.variableValues["coval-simulation-id"]` in webhook handler | | **Generic Python** | Any Python project | `setup_coval_tracing()` at module level; `# TODO` comment marking where to call `set_simulation_id()` | ## **What It Sets Up** The wizard creates `coval_tracing.py` — a self-contained module you import in your agent. It provides three public functions: ```python from coval_tracing import setup_coval_tracing, set_simulation_id, instrument_session # Call once at startup (or at the start of each call session) setup_coval_tracing(service_name="my-agent") # Call when the Coval simulation ID arrives (e.g. from a SIP header) set_simulation_id(simulation_id) # LiveKit only — hooks session events to emit STT, LLM, and tool call spans instrument_session(session) ``` Spans emitted before `set_simulation_id()` is called are buffered and flushed automatically once the ID arrives — no spans are lost even if tracing initializes before the call connects. The module also includes convenience helpers (`create_llm_span`, `create_stt_span`, `create_tts_span`, `create_tool_call_span`) for manually wrapping operations in spans if needed. ## **What It Doesn't Set Up** > **Warning:** The wizard installs the tracing infrastructure but does **not** add advanced span attributes automatically — *yet*. For richer observability, you will need to add these manually after running the wizard: > > - `stt.confidence` — ASR confidence score per utterance (0.0–1.0). Requires hooking into your STT provider's result to extract the confidence value. > - `llm.finish_reason` — Why the LLM stopped generating (`stop`, `tool_calls`, `length`). Requires observing the LLM response before emitting the span. > - `gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens` — LLM token counts. Requires extracting token usage from your LLM response. > - `stt.provider.` sub-spans — Per-provider attempt spans for fallback chains. Requires wrapping each provider call individually. > > See the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents) for complete reference implementations that include all of these attributes. ## **Environment Variables** | Variable | Required | Description | |----------|----------|-------------| | `COVAL_API_KEY` | Yes | Your Coval API key. Prompted interactively if not set. | | `WIZARD_LLM_KEY` | No | LLM API key for direct use. | | `WIZARD_LLM_PROVIDER` | No | Optional provider identifier for direct use. | | `WIZARD_LLM_MODEL` | No | Optional model override for direct use. | ## **Limitations** - **Python only** — The wizard requires a Python project manifest (`pyproject.toml`, `requirements.txt`, `Pipfile`, or `setup.py`) in the target directory. - **Single entry point** — Only one file is analyzed and modified. The entry point must be named `agent.py`, `main.py`, `bot.py`, `app.py`, or `server.py`, or be the sole `.py` file in the project root. Entry points larger than 50 KB are not supported. - **LLM-generated code** — The wizard uses a language model to determine what to inject. Results are generally accurate for standard Pipecat, LiveKit, and Vapi patterns, but unusual project structures may require manual corrections. Always review the diff before confirming. - **Generic Python wizard support is minimal** — For projects that aren't one of the three supported frameworks, the wizard adds `setup_coval_tracing()` and a `# TODO` comment. You will need to call `set_simulation_id()` manually and instrument spans yourself. - **Validation is connectivity-only** — The validation step confirms that your API key can reach `api.coval.dev`. It does not verify that spans are correctly wired to your agent's call lifecycle. - **No multi-file analysis** — The wizard reads your entry point and dependency manifest only. It does not analyze helper modules, shared utilities, or subpackages. ## **Next Steps** After running the wizard: 1. Deploy your updated agent and run a simulation in Coval. 2. Open the result and look for the **OTel Traces** card — traces appear within a few seconds of the simulation completing. 3. To add richer span attributes (`stt.confidence`, `llm.finish_reason`, provider sub-spans), see the [OpenTelemetry guide](/concepts/simulations/traces/opentelemetry) and the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents). --- ## Tracing Skills Source: https://docs.coval.ai/concepts/simulations/traces/tracing-skills Use Coval external skills to instrument, validate, optimize, and debug agent traces with a reviewable AI-assisted workflow. Coval tracing skills are promptable instructions for your AI coding agent. They help your agent inspect your repository, add OpenTelemetry export to Coval, verify one real trace, improve span quality, create trace-based metrics, and debug missing or sparse traces. Use tracing skills when you want a reviewable workflow instead of a one-command code modification. The skills do not change your code by themselves. Your coding agent reads the skill instructions, proposes a plan, edits your repo only when you approve that workflow, and leaves you with a diff you can review. > **Tip:** If you want a one-command Python setup for a supported voice agent, use the [Coval Wizard](/concepts/simulations/traces/wizard). If your team needs a slower, reviewable process or has a custom agent shape, use the tracing skills. ## **Copy this for your coding agent** Open your agent repository in your coding agent of choice, then copy and paste this prompt. ```text I want you to configure this agent to send traces to Coval using Coval's external tracing skills. Use this skills repository: https://github.com/coval-ai/coval-external-skills If your environment supports Agent Skills, install or load the repo with: npx skills add coval-ai/coval-external-skills Then use the tracing skills in `skills/traces/`: 1. Start with `setup-tracing` to add Coval OpenTelemetry export and launch one real Coval validation run. 2. While that run is pending, use the waiting time to improve trace coverage and prepare trace metrics instead of sitting idle. 3. After spans appear, use `optimize-trace-observability` to confirm and refine span coverage and attributes. 4. Use `configure-trace-metrics` to recommend and create useful custom trace metrics from confirmed real span data. 5. Use `debug-traces` if traces are missing, sparse, duplicated, or attached to the wrong Coval result. Do not ask me to fill out a setup template first. Infer as much as you can from the repository, project files, existing Coval configuration, deployment files, logs, README/runbooks, and authenticated Coval CLI/API access if available. During read-only discovery, determine: - Coval agent type: SIP inbound voice, inbound phone over PSTN, outbound voice, WebSocket voice/chat, or conversation monitoring - Coval agent ID or name - How the agent runs locally - How the agent is deployed - How to trigger one test in Coval - Existing observability in this repo, such as OpenTelemetry, Langfuse, Arize, LangSmith, or none Rules: - Assume you do not have access to Coval internal backend, frontend, docs, wizard, research, or example source repositories. Do not ask me for Coval internal source code. Use this agent repository, public Coval docs, Coval CLI/API access, and fetched public OpenAPI specs only. - Make code changes only in this customer-owned agent repository. Coval-side changes must be limited to documented configuration through the Coval CLI, public API, or dashboard, and you should explain them before mutation. - Start with read-only analysis. Do not edit files until you summarize the current agent entry point, Coval connection path, correlation ID path, existing telemetry, and smallest additive implementation plan. - Ask clarifying questions only for details you cannot discover and that block safe implementation or validation. Keep questions concise and grouped. - Do not ask me to choose between SIP headers, pre-call webhooks, registration endpoints, trigger payloads, or WebSocket initialization when the repository and Coval agent configuration make one route clearly safer. Pick the route, explain why, and ask only for concrete missing permission such as exposing a webhook, updating Coval agent configuration, or provisioning SIP. - Do not print, store, or hard-code secrets. If you need credentials, ask me to provide them as environment variables. - If this is an inbound phone/PSTN agent, do not assume SIP headers are available. Use a pre-call or registration-webhook correlation path, or tell me if I need a SIP-capable route. - Keep changes additive and preserve existing telemetry providers when present. - Run relevant local checks and tell me exactly what to deploy. - Validate with one real Coval simulation or monitoring conversation launched through the Coval CLI/API. This is asynchronous: start the run, capture the run/result/conversation IDs, poll through CLI/API, and while waiting continue with safe trace enrichment and metric preparation. Create custom trace metrics while waiting only if existing or in-flight Coval traces already prove the span/attribute exists; otherwise stage the metric definitions and create them once the validation trace appears. - If traces are low-signal (for example most spans share one generic name such as `conversation`), treat setup as incomplete and use `optimize-trace-observability` to add role- and phase-specific spans and attributes before calling the workflow done. - After creating or updating trace metrics, launch or reuse a recent run with traces and confirm each metric reaches a terminal computed state. Do not stop at `IN QUEUE` or `IN PROGRESS`. - Report the files changed, commands run, correlation path used, Coval launch/polling commands or API endpoints, metrics created or staged, and the Coval simulation or conversation ID that proves tracing works. - Include direct URLs in this exact format: - display of all runs: `https://app.coval.dev//runs?sort=createdAt%3Adescending` - display of specific run: `https://app.coval.dev//runs/` - display of simulation within run: `https://app.coval.dev//runs//results/` - display of traces for that simulation: `https://app.coval.dev//runs//results//traces` ``` ## **Install the skills** From any directory where your coding agent can access installed skills: ```bash npx skills add coval-ai/coval-external-skills ``` If your environment does not allow `npx`, clone and review the repository directly: ```bash git clone https://github.com/coval-ai/coval-external-skills.git ``` Then point your coding agent at `coval-external-skills/skills/traces/` and ask it to use the skill you need. ## **Tracing skill inventory** | Skill | Use it when | |-------|-------------| | `setup-tracing` | Your agent is connected to Coval but does not send traces yet. | | `optimize-trace-observability` | Traces exist, but they are sparse or missing useful spans and attributes. | | `configure-trace-metrics` | You have traces and want metrics for latency, tool behavior, errors, or business events. | | `debug-traces` | Traces are missing, attached to the wrong result, duplicated, too large, or not useful in production. | ## **Before you start** Have these ready: - Your agent repository - A Coval API key with access to the organization that owns the agent - The Coval agent ID or the exact agent name - The Coval connection type for the agent - A command or runbook for one test simulation or monitoring conversation - A deployment path for the changed agent Do not paste API keys into prompts. Set them in your shell or secret manager instead: ```bash export COVAL_API_KEY="" ``` If you use the Coval CLI, authenticate before asking your coding agent to work: ```bash brew install coval-ai/tap/coval coval login coval whoami ``` ## **Choose the right correlation path** Coval traces must be tied to one simulation output or one submitted conversation. Tell your coding agent which path your agent uses. | Agent path | What the skill should wire | |------------|----------------------------| | SIP inbound voice | Read `X-Coval-Simulation-Id` from SIP headers or framework participant attributes, then export spans with `X-Simulation-Id`. | | Inbound phone over PSTN | Do not expect SIP headers. Use a pre-call or registration webhook correlation path, or provision a SIP-capable address. | | Outbound voice | Carry `simulation_output_id` through the outbound trigger payload, then export spans with `X-Simulation-Id`. | | WebSocket voice or chat | Carry the simulation output ID through the initial setup payload, then export spans with `X-Simulation-Id`. | | Conversation monitoring | Buffer spans during the conversation, submit the conversation, then export spans with `X-Conversation-Id`. | ## **Detailed setup prompt** Use this longer prompt when you only want the first setup step. ```text Use the Coval `setup-tracing` skill from `coval-ai/coval-external-skills`. Goal: configure this agent to send OpenTelemetry traces to Coval and prove one real trace reaches Coval. Do not ask me to fill out a setup template first. Infer as much as you can from the repository, project files, existing Coval configuration, deployment files, logs, README/runbooks, and authenticated Coval CLI/API access if available. Instructions: 1. Treat Coval internals as unavailable. Do not ask for or rely on Coval backend, frontend, docs, wizard, research, or example source repositories. Use this agent repository, public Coval docs, Coval CLI/API access, and fetched public OpenAPI specs only. Make code changes only in this customer-owned agent repository. 2. Start with read-only analysis. Do not edit files until you summarize: - the app entry point - the current Coval connection path - where the Coval simulation output ID or conversation ID can enter the process - whether existing telemetry should be reused - how the agent appears to run locally - how the agent appears to be deployed - how to trigger one Coval validation - the smallest additive implementation plan 3. Ask clarifying questions only for details you cannot discover and that block safe implementation or validation. Keep questions concise and grouped. 4. If this is an inbound phone/PSTN agent, do not assume SIP headers are available. Guide me through a pre-call or registration-webhook correlation path, or tell me if I need a SIP-capable route. 5. Do not ask me to choose between SIP headers, pre-call webhooks, registration endpoints, trigger payloads, or WebSocket initialization when the repository and Coval agent configuration make one route clearly safer. Pick the route, explain why, and ask only for concrete missing permission such as exposing a webhook, updating Coval agent configuration, or provisioning SIP. 6. Keep changes additive. Do not hard-code API keys or replace existing telemetry providers. 7. Add only the dependencies and helper files needed for tracing. 8. Run the relevant local checks for this repo. 9. Tell me exactly what to deploy, then launch one Coval validation through the Coval CLI/API. Capture the run ID, simulation output ID, conversation ID, dashboard URL, or polling endpoint returned by the launch. 10. Poll the validation through CLI/API with a bounded interval and timeout. While it is pending, continue with safe trace enrichment and metric preparation rather than waiting idle. 11. Create custom trace metrics during the wait only if historical traces or the in-flight validation already prove the span name and metric attribute exist. If this is the first trace for the agent, stage the metric definitions and create them immediately after the validation trace appears. 12. After validation, summarize the files changed, the correlation path used, commands run, Coval launch/polling commands or API endpoints, metrics created or staged, and the Coval simulation or conversation ID that proves the trace worked. 13. If spans are low-signal (for example mostly `conversation`), keep going with `optimize-trace-observability` and produce a second validation trace that demonstrates richer span coverage. 14. After creating or updating metrics, run or inspect a trace-backed simulation result and confirm terminal metric outputs. Report the metric output status and value, not only metric creation. 15. Output proof URLs for runs list, run, result, and result traces using this exact format: - `https://app.coval.dev//runs?sort=createdAt%3Adescending` - `https://app.coval.dev//runs/` - `https://app.coval.dev//runs//results/` - `https://app.coval.dev//runs//results//traces` ``` ## **After validation starts** Once the initial validation run has started, use the follow-up skills while it is pending. They should still verify against the finished Coval trace before declaring success. ### Improve trace quality ```text Use the Coval `optimize-trace-observability` skill. I have launched or already have a Coval validation trace for this agent. Improve the trace quality so it is useful for debugging and trace-based metrics. Find a recent traced simulation or conversation through the repo, Coval CLI/API, dashboard context I provide, or Trace Search if available. If the current validation run is still pending, poll it through the Coval CLI/API and use the waiting time for safe code-visible enrichment. If you cannot find or launch a validation run, ask me for the smallest missing detail before editing. Start by inspecting the current spans. Prefer enriching existing framework/provider spans over creating duplicates. Add or improve spans for conversation, turn, speech recognition, language model calls, speech synthesis, tool calls, transport, or pipeline steps where the code exposes them. Add safe attributes such as transcripts, time to first byte, finish reasons, token counts, tool names, bounded tool arguments, status, and errors. Do not record secrets, raw audio, or unbounded payloads. Run local checks, then tell me what changed and what new debugging questions the trace can answer. ``` ### Create trace metrics ```text Use the Coval `configure-trace-metrics` skill. I have traces for this agent, or an initial validation run is currently pending, and want a small set of high-signal metrics. Infer the agent, recent traced results, and likely use case from the repository, Coval CLI/API, existing run history, dashboards, docs, or naming. If a validation run is pending, poll it through the Coval CLI/API while preparing metric definitions. Inspect real span names and attributes first. Recommend 3-6 custom trace metrics using attributes that actually exist. Include the span name, metric attribute when needed, aggregation method, unit, and why each metric is useful. Create metrics while waiting only when historical traces or the in-flight validation already prove the span/attribute exists. If the initial validation is the first trace, stage the metric definitions and create them once the trace appears. Ensure at least two metrics are high-signal for agent success, health, or performance (for example interruption rate, tool failure rate, critical tool latency, or end-to-end conversation success proxy). If the business goal is ambiguous, ask me what outcomes I care about before creating metrics. ``` ### Debug missing or sparse traces ```text Use the Coval `debug-traces` skill. Expected traces are missing, sparse, duplicated, or attached to the wrong Coval result. Infer the agent type, likely Coval result, export path, and recent errors from the repository, Coval CLI/API, logs, dashboard context I provide, and existing tracing code. If you cannot identify the failing Coval run, simulation, conversation, or export response, ask me for the smallest missing detail needed to continue. Separate the failure boundary first: agent export, Coval ingestion, target ID correlation, organization/API key, UI lookup, payload size, duplicate retries, or sparse span coverage. Ask me to run commands locally if needed, but do not ask for raw API keys. ``` ## **What success looks like** A completed setup should leave you with: - A reviewed diff in your agent repository - No hard-coded secrets - One clear correlation path: `X-Simulation-Id` or `X-Conversation-Id` - A successful local check - A deployed agent that still handles calls or messages normally - A completed Coval simulation or conversation with a trace visible in the OTel Traces card or Trace Search - At least one useful span such as `conversation`, `turn`, `stt`, `llm`, `tts`, or `llm_tool_call` - Custom trace metrics that are created and computed from real spans in at least one completed run/result - Direct proof URLs for runs list, run, result, and trace viewer For production debugging, continue with `optimize-trace-observability` and `configure-trace-metrics` so traces answer the questions your team actually investigates. ## **See also** - [OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry) - [Coval Wizard](/concepts/simulations/traces/wizard) - [Custom Trace Metrics](/concepts/metrics/custom-trace-metrics) - [Trace Search](/concepts/conversations/trace-search) --- ## Dashboard Source: https://docs.coval.ai/concepts/dashboard/overview Monitor, analyze, and optimize your voice AI performance with real-time insights and drill-down capabilities. # Analytics Dashboard for Voice AI ![Coval's Dashboard](/images/dashboard/overview.png) Monitor, analyze, and optimize your voice AI performance with real-time insights and drill-down capabilities. Whether you're tracking simulation results, monitoring live system health, or analyzing conversation patterns, our dashboard provides the tools you need to make data-driven decisions. ## **Flexible Dashboard** - Multiple Configurable Dashboards: Organize and switch between dashboards instantly to monitor different aspects of your voice AI system - Intuitive Layout: Rearrange widgets, automatically adapts to any screen size, resize widgets, and automatically saved ## Widget Configuration with Live Preview [Image: Widget Configuration] - Choose from all available metrics on Coval's platform - See your changes applied immediately in the live preview pane - Filter widgets by specific agent IDs for focused monitoring - Add multiple aggregation dimensions (Agent + Persona combinations) ## **Widget Library** ### **Bar Charts** - Compare performance across categories with regular or stacked visualizations - Perfect for analyzing success rates, error distributions, or agent performance comparisons - Smart color-coding for binary metrics (Yes/No scenarios) ### **Line Charts** - Track trends over time with multi-series support - Built-in outlier detection to quickly spot anomalies - Ideal for monitoring response times, call volumes, or quality scores ### **Area Charts** - Stacked area charts for visualizing cumulative trends over time - Multi-series support with configurable fill opacity - **Percentage mode**: Toggle "Show as percentage" to normalize stacked series to 100% for share-of-whole analysis - **Drag to zoom**: Click and drag across any time range to zoom in; click the reset button to return to the full range - Custom X and Y axis labels supported ### **Pie / Donut Charts** - Donut-style chart for visualizing proportional breakdowns across metric output values or groups - Interactive slices — click any slice to drill down into the individual runs behind that segment - Auto-generated legend with smart grid layout for large numbers of categories - Color-coded by value or split dimension (agent, persona, etc.) ### **Statistic** Display a single aggregated metric value as a large, prominent number — ideal for KPI summaries and at-a-glance health checks. - Supports **avg, sum, count, min, max** aggregation modes - Shows supporting statistics: sample count (n=), standard deviation (σ), and min/max range - Optional **box plot** overlay shows the distribution of values alongside the headline number - **Grouped mode**: When split by agent, persona, or template, renders a grid of stat cards — one per group — with density-adaptive layout (normal / compact / dense) based on widget size ### **Top List** - Horizontal bar chart showing the top 10 values ranked by metric score - Ideal for surfacing best/worst performing agents, personas, or string metric outcomes - Supports click-through drill-down to the underlying runs - Color-coded using your metric's custom color map or target condition values ### **Histogram** - Distribution chart showing how frequently metric values fall into each value bucket - Useful for understanding spread and outliers in numeric metrics (e.g., latency, duration, scores) - Automatically calculates bin sizes based on the data range - Click any bin to drill down into the runs in that value range ### **Table** - Tabular view of metric data, sortable by column - Shows one row per time bucket or group dimension - Useful for detailed audits and exporting raw metric values alongside metadata ### **Text** - Display static text or notes anywhere on the dashboard - Use for section headers, instructions, or annotations alongside metric widgets - No data source required — purely presentational ### **Human Review Management** Monitor and manage your human review assignees directly from the dashboard. These widgets give you a centralized view of review progress and activity across all your projects. [Video: Review Management Walkthrough](https://www.loom.com/embed/9b96bd12624e4624b860e56f0f5aa375) ### **Threshold / Target Zone Visualization** Display your org-level metric thresholds directly on dashboard charts to quickly assess whether performance is within acceptable bounds. - **Enable it**: Toggle **"Show target zone"** in the widget configuration panel for success rate bar or average line chart widget - **Line charts**: renders a shaded "fail zone" above or below the threshold line, making out-of-range periods immediately visible - **Bar charts**: renders a dashed reference line at the threshold value for at-a-glance comparison against each bar - The threshold value is pulled automatically from the custom threshold set in the metrics page ### **Question Monitoring** - Track unanswered questions and conversation gaps - Identify areas where your AI needs improvement - Monitor user satisfaction and engagement patterns ## **Filtering & Analysis** ### **Multi-Dimensional Aggregation** - **Agent-Level Analysis**: Compare performance across different AI agents - **Persona-Based Insights**: Analyze how different conversational personas perform - **Combined Analysis**: Mix and match agents and personas for comprehensive views - **Binary Metric Support**: Automatic Yes/No breakdowns for quality metrics ### **Intelligent Date Range Management** - **Per-Widget Control**: Set different time periods for each widget - **Global Overrides**: Apply date ranges across all widgets instantly - **Smart Presets**: Today, Yesterday, Last 7/30/90 days, Year-to-date - **Custom Ranges**: "Last N days/weeks/months" or specific date periods ### **Metadata Filtering** Filter your dashboard data using custom metadata fields attached to your simulations or live calls. This allows you to segment your analytics by any attributes you track—such as customer tier, campaign ID, region, or experiment variant. > **Note:** Metadata filters work with any key-value pairs you've included in your > simulation or conversation data. Values are matched exactly, and multiple > filters are combined with AND logic. **How it works:** - **Key Selection**: Choose from existing metadata keys or enter a custom key name - **Value Filtering**: Select from suggested values or type your own custom value - **Search Support**: Quickly find values by typing—results filter as you type - **Multiple Filters**: Add several metadata filters to narrow down your analysis **Common Use Cases:** - Compare performance across different customer segments - Analyze A/B test results by experiment variant - Filter by deployment environment (staging vs. production) - Segment by geographic region or language ### **Test Case Filtering** Filter widget data by specific test cases to isolate performance on a particular subset of scenarios. - Select one or more test cases from any test set connected to your simulations - Combine with agent, persona, and metadata filters for precise segmentation - Useful for regression tracking on a fixed set of canonical test inputs ### **Powerful Filtering Options** - Filter by agent types, conversation attributes, metadata fields, and test cases - Real-time filter application without page refresh - Save and reuse common filter combinations ## **Deep Dive Analysis** ### **Focus Mode** - Click any data point to enter full-screen analysis mode - 50/50 split view: chart on left, detailed run data on right - Seamless transition back to overview dashboard ### **Run Details Investigation** - Drill down from aggregate charts to individual conversation data - See exactly which calls contributed to each data point - Investigate outliers and anomalies at the source level - Full context for root cause analysis ![Run Details Investigation](/images/dashboard/run-details.png) ### **CSV Export** Export drill-down data directly from the run details panel. - **Internal links**: Generates links accessible to logged-in team members - **Shareable links**: Makes the linked simulations public so you can share them externally - Exports all visible columns including metric values, timestamps, and run metadata ### **Smart Data Bucketing** - Automatic time bucket optimization based on data density - Calendar-aware grouping (hours, days, weeks, months) - Timezone-aware calculations for accurate reporting ## **Alerts from Widgets** Create a monitor directly from any metric widget without leaving the dashboard. Click the bell icon on a widget and configure: - **Monitor name**: Label for the alert - **Threshold**: The value that triggers the alert (GT, GTE, LT, LTE, or EQ) - **Run types**: Apply the monitor to simulation runs, monitoring runs, or both This is a shortcut to the Monitors page — the widget's metric is pre-populated so you only need to set the threshold condition. --- > **Tip:** **Ready to transform your voice AI analytics?** Our dashboard platform > provides the insights and tools you need to monitor, analyze, and optimize > your voice AI performance with confidence. --- ## Human Review Source: https://docs.coval.ai/guides/improving-metrics-with-human-review Step-by-step guide to refine your metrics using human review ## Overview Coval's human review projects give you real feedback on the accuracy of your metrics. Create a project, label your simulations, and then use the metrics studio to improve metric performance. Note: Any conversation can be annotated in the results page, regardless of human review projects. > **Note:** Human review is supported for a subset of metric types. See [Supported Metric > Types](/concepts/metrics/human-review/human-review#supported-metric-types) for > the full list. ## Project Types When creating a human review project, you can choose between two modes using the **Collaborative** toggle. ### Collaborative Projects In **Collaborative** mode, all reviewers share a single queue and work toward a unified set of labels. **How it works:** - Each metric-simulation pair has one shared annotation — only one review score is recorded per pair - Reviewers can see each other's existing annotations as pre-fill when they open a conversation - A 10-second polling lock prevents two reviewers from annotating the same row at the same time — if a metric is locked, another reviewer is actively annotating it - An assignment is marked complete as soon as _any_ reviewer submits an annotation **Best for:** Building a ground-truth dataset, dividing labeling work across a team without duplication, or any scenario where a single authoritative label per conversation is the goal. ### Individual Projects (Default) In **Individual** mode, each reviewer has their own private queue and annotations. Reviewers cannot see each other's work. **Best for:** Measuring inter-annotator agreement, collecting multiple independent labels for the same conversation, or comparing perspectives across reviewers. > **Tip:** Use **Collaborative** mode when you want one ground-truth label per > conversation. Use **Individual** mode when you want to measure consistency or > collect multiple independent perspectives. ## Step-by-Step Workflow **Step: Create a human review project** Human review projects help you manage assignees, simulations, and metrics that you are accurately looking to track. ![Create a project](/images/human-review/project.png) 1. Navigate to the [projects tab of the **Human Review** page](https://app.coval.dev/appointmentdemo/review/projects) 2. Choose which metrics you would like to label 3. Assign labelers to the project > **Tip:** Set auto-add rules to have conversations that pass a certain condition (or all conversations) get reviewed. **Step: Add Conversations** Add conversations to label. You can do this in the runs page, conversations, or on a single simulation. ![Add runs from the runs page](/images/human-review/add-runs.png) **Step: Open your assignments** Navigate to the [Human Review page](https://app.coval.dev/review) in the Coval Dashboard. The **Assignments** tab shows all pending annotations assigned to you. Click on an assignment to open the review interface with the conversation transcript (and audio player for voice simulations) alongside the metrics to evaluate. **Step: Label conversations** Read transcripts, listen to the audio, and provide your ground-truth assessment for each metric: - **Binary metrics**: Select Yes, No, or N/A - **Numerical metrics**: Enter a value within the configured range - **Categorical metrics**: Choose from the dropdown - **Audio region metrics**: Mark or edit regions on the waveform timeline - **Composite metrics**: Toggle MET / NOT_MET / UNKNOWN for each criterion Optionally add notes to explain your labeling decision. Notes can be positioned anywhere in the review interface and are visible to project collaborators. Use keyboard shortcuts `h` / `l`, `a` / `d`, or the left and right arrow keys to move between assignments. Use `b` or `Escape` to back out of the current review surface when needed. ![Review Interface](/images/human-review/labeling.gif) **Step: Check progress** Switch to the [Projects tab](https://app.coval.dev/review/projects) to see overall project completion, per-assignee progress, and annotation statistics. **Step: Improve your Metric** ![Metric Details](/images/human-review/testing.gif) 1. Navigate to your metric in the Metrics tab. 2. Agreement shows on your metric, so you can decide to improve it. 3. Draft a new version of your metric in the prompt box 4. Click "Test Metric" to open the testing panel. 5. View the results of your experiment. For the broader app-wide model, see the [Keyboard Navigation guide](/guides/keyboard-navigation). --- ## Conversations Source: https://docs.coval.ai/concepts/conversations/overview Conversations let you run evaluations on your live calls. ## Understanding Coval Conversations By pushing your post-call transcript to Coval (transcript-only or incl. audio), you can run all Metrics that you run for simulations, also for Observability. The goal is to not only test your agent pre-production but also to observe and evaluate how your agent behaves in production. ### Audio File Requirements When uploading audio files to conversations: - **Stereo (recommended)**: Channel 0 (left) = Agent, Channel 1 (right) = User. Roles are assigned deterministically from channel position. - **Mono**: Also supported. Speaker roles are inferred from transcript content via an LLM, so mapping is typically accurate but less reliable than stereo channel-based mapping. > **Info:** **Features specific to Conversations:** > > - **Default Metrics**: Define your set of default metrics to run on all incoming transcripts > - **Metric Rules**: Add metrics conditionally based on results or metadata keys > - **Add to Test Sets**: Convert production issues into regression tests > - **[OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry#tracing-for-conversations)**: Send trace data from your agent alongside conversation submissions for detailed performance analysis — via the API, the OpenTelemetry SDK, or directly from the Upload to Conversations dialog ## Rerunning Metrics on Historical Calls If you change a metric formulation or add a new metric, you can retroactively apply it to historical conversations without re-ingesting transcripts. **How to rerun metrics from the conversations table:** 1. Open the **Conversations** page. 2. Click **Select Rows** along the top menu bar under Conversations and use the checkboxes on the left to select one or more calls. 3. Click the **Rerun Metrics** button in the action bar. 4. In the modal, select the metrics you want to run, then confirm. You'll see a toast confirming the launch. [Video: Rerun Metrics on Historical Calls walkthrough](https://www.loom.com/embed/59674b1a5b68468380deb9d780a03814) **Limits:** - You can rerun metrics on up to **500 calls** per batch. If you select more than 500, the button shows a count and displays a toast asking you to reduce the selection. - Only calls with an existing evaluation output are eligible. Calls without one are automatically filtered out. > **Note:** Rerunning metrics does not re-ingest or re-simulate the call. It re-evaluates the existing transcript and outputs against the selected metrics. Depending on the number of calls and metrics selected, this may take a few moments to complete. --- ## Async audio attach Source: https://docs.coval.ai/concepts/conversations/async-audio-attach Submit transcripts immediately at call end and attach audio when the recording URL finalizes — designed for telephony stacks (Twilio, AWS Connect, Vonage) where the recording URL is asynchronously generated. ## Overview Most telephony stacks finalize the recording URL **after** the call has ended — typically 30 to 90 seconds later, well past the point where many agent containers have been recycled or scaled in. That makes it unreliable to submit transcript and audio together in a single, atomic post-call call. The solution is a two-call pattern. At call end, you `POST /v1/conversations:submit` with the transcript and metadata. You receive a `conversation_id` back synchronously. Once your telephony platform tells you the recording URL is finalized, you `PATCH /v1/conversations/{conversation_id}` to attach it. The benefit is that text-only metrics — anything derived from the transcript — start running the moment the conversation is submitted. Audio-derived metrics run as a second wave, once the recording is attached. ## The two-call flow **Step: Submit transcript at call end** `POST /v1/conversations:submit` accepts the transcript and returns a `conversation_id` synchronously. Text-only metrics begin running immediately. The conversation enters `IN_PROGRESS` status. ```bash curl -X POST https://api.coval.dev/v1/conversations:submit \ -H "x-api-key: $COVAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "transcript": [ {"role": "agent", "content": "Hello, how can I help you today?"}, {"role": "user", "content": "I need to check my order status."} ], "agent_id": "your-agent-id", "external_conversation_id": "CA1234567890abcdef", "occurred_at": "2026-05-13T18:42:00Z", "metadata": { "channel": "twilio-pstn", "from_number": "+14155550100" } }' ``` Response: ```json { "conversation_id": "conv_01HXYZ...", "status": "IN_PROGRESS", "has_audio": false } ``` Store the returned `conversation_id`. You will need it for the PATCH call. **Step: Attach the recording URL when ready** Once the recording URL finalizes (your telephony stack will deliver this via a webhook or callback), `PATCH /v1/conversations/{conversation_id}` with `audio_url`. Audio-derived metrics now run as a second wave. ```bash curl -X PATCH "https://api.coval.dev/v1/conversations/$CONVERSATION_ID" \ -H "x-api-key: $COVAL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "audio_url": "https://api.twilio.com/2010-04-01/Accounts/AC.../Recordings/RE....wav" }' ``` Response: ```json { "conversation_id": "conv_01HXYZ...", "status": "IN_PROGRESS", "has_audio": true } ``` See the [full PATCH conversation API reference](/api-reference/v1/conversations/conversations/attach-audio-to-a-conversation) for all supported fields. You can also attach inline base64 audio via the `audio` field rather than a remote URL. **Show full submit request body** ```json { "transcript": [ {"role": "agent", "content": "Hello, how can I help you today?"}, {"role": "user", "content": "I need to check my order status."}, {"role": "agent", "content": "I can help with that. Can I have your order number?"}, {"role": "user", "content": "It is 12345."} ], "agent_id": "your-agent-id", "external_conversation_id": "CA1234567890abcdef", "occurred_at": "2026-05-13T18:42:00Z", "metadata": { "channel": "twilio-pstn", "from_number": "+14155550100", "to_number": "+18005551212", "tenant_id": "acme-corp" } } ``` ## Webhook timing — two waves > **Note:** **Configure your webhook consumer to expect two waves.** > > - **Wave 1 (text-only metrics)** — fires within seconds of the `POST /v1/conversations:submit` returning. Carries everything derived from the transcript. > - **Wave 2 (audio-derived metrics)** — fires only after the `PATCH /v1/conversations/{conversation_id}` lands and the recording is processed. Carries metrics like STT WER, audio sentiment, and diarization-dependent scores. > > Dedupe by `external_conversation_id` if you store webhook payloads downstream — the same conversation will be referenced by both waves. ## Idempotency Audio can be attached **once** per conversation. The PATCH call is single-shot. - A second PATCH to the same `conversation_id` returns `409 ALREADY_EXISTS`. - A conversation submitted with audio inline via `POST /v1/conversations:submit` (passing `audio_url` or `audio` in the initial body) already has audio attached and cannot be PATCHed. - To re-run with a corrected recording, submit a new conversation with a new `external_conversation_id`. > **Warning:** If your recording-status webhook fires more than once for the same call — some providers retry on transient receiver errors — your handler must be idempotent against the 409. Treat the first 200 as success and ignore subsequent 409s. ## When you don't need this > **Tip:** **Skip the PATCH step if either of the following applies:** > > - **You control the media path.** If your stack hands you the finalized recording URL or bytes at the moment the call ends — for example, you run your own media server and write the file before the agent disconnects — just call `POST /v1/conversations:submit` with `audio_url` (or `audio`) in the same payload. There is nothing to PATCH later. > - **You only need transcript-derived metrics.** Submit the transcript with `POST /v1/conversations:submit` and stop there. The conversation will be evaluated against any metric that does not require audio, and `has_audio` will remain `false`. ## See also - [Conversations overview](/concepts/conversations/overview) — the parent concept page - [Submit a conversation for evaluation](/api-reference/v1/conversations/conversations/submit-conversation-for-evaluation) — the `POST /v1/conversations:submit` API reference - [Attach audio to a conversation](/api-reference/v1/conversations/conversations/attach-audio-to-a-conversation) — the `PATCH /v1/conversations/{conversation_id}` API reference - [Observability guide](/guides/observability) — broader walkthrough of production-call evaluation - [Trace Search](/concepts/conversations/trace-search) — search across traced conversations once they are submitted --- ## Trace Search Source: https://docs.coval.ai/concepts/conversations/trace-search Search across all traced calls in your organization with natural language queries, structured filters, and Transition Hotspots analysis. [Video: Cross-Call Trace Search walkthrough](https://www.loom.com/embed/c1d437ec33ef4555b38c8b44602b348d) ## Overview Trace Search lets you query across all traced calls in your organization — both simulations and conversations — from a single page. Instead of inspecting traces one call at a time, you can search by span name, provider, status, duration, attributes, or plain English to find patterns across your entire call history. Access Trace Search from the **Trace Search** button on the Conversations page, or from the left sidebar under *Observability*. ## Natural Language Search Type a plain English query into the search bar and Trace Search will interpret it into structured filters automatically. For example: - `"slowest 10 LLM calls in the last 24 hours"` - `"errors from NormalAgentName agent in the last two weeks"` - `"TTS calls longer than 2 seconds"` The NL search supports sort keywords (`slowest`, `fastest`, `newest`, `oldest`), result limits (`top 10`, `five slowest`), and time expressions. ## Structured Filters You can also build precise queries using the structured filter controls: | Filter | Description | |--------|-------------| | **Time Range** | Preset options (1h, 3h, 12h, 24h, 3 days, 1 week, 2 weeks, 1 month) or custom date/time pickers | | **Span Name** | Filter by span type (e.g. `llm`, `tts`, `stt`, `llm_tool_call`) with autocomplete from your actual trace data | | **Provider** | Filter by service provider with autocomplete suggestions | | **Status** | Filter by span status: `ERROR`, `OK`, or `UNSET` | | **Duration** | Set minimum and/or maximum duration in milliseconds | | **Attributes** | Add one or more attribute filters with operators: Contains, Equals, Exists, `>`, `>=`, `<`, `<=` | | **Agent** | Scope results to a specific agent | | **Test Set** | Scope results to a specific test set | Multiple attribute filters stack with AND logic. Autocomplete suggestions are drawn from your organization's actual trace data in ClickHouse. ## Results Search results show as cards with: - Call source (Simulation or Conversation) and timestamp - Overall span status with color coding - Total span count and root span duration - Provider and service metadata - Direct link to the full trace viewer for each result An aggregate stats bar at the top shows the total number of matching calls and the error rate across results. ## Transition Hotspots The **Transition Hotspots** tab shows a failure matrix computed across your filtered search results. This is the same heatmap visualization available on individual runs (see [Transition Hotspots](/concepts/simulations/traces/opentelemetry#transition-hotspots)), but applied across all calls matching your current search filters. Use this to identify systemic failure patterns that span multiple runs or conversations — for example, a specific state transition that consistently fails across your production traffic. --- ## API Keys Source: https://docs.coval.ai/guides/api-keys Create and manage API keys to authenticate with the Coval API API keys are used to authenticate requests to the Coval REST API and CLI. You can create multiple keys per organization, each with its own permissions and lifecycle. ## Creating an API Key **Step: Open Settings** Navigate to **Settings** in the Coval dashboard sidebar, then select the **API Keys** tab. **Step: Click Create Key** Click the **Create Key** button in the top right corner. **Step: Configure your key** Fill in the following fields: | Field | Required | Description | |-------|----------|-------------| | **Name** | Yes | A descriptive name for the key (e.g., "CI/CD Pipeline", "Local Development") | | **Description** | No | Optional notes about the key's purpose | | **Key Type** | Yes | **Service** for automated systems, **User** for individual access | | **Permissions** | Yes | **Full Access** or scoped to specific resources | **Step: Save your key** Click **Create Key**. Your API key will be displayed once. > **Warning:** Copy the key immediately. You will not be able to view the full key again after closing the dialog. ## Using Your API Key Include the key in the `X-API-Key` header with every request: ```bash curl https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" ``` Or set it as an environment variable for the CLI: ```bash export COVAL_API_KEY=your_api_key coval agents list ``` ## Permissions By default, keys are created with **Full Access** to all resources. You can restrict a key to specific scopes using the permissions picker. Available permission scopes: | Resource | Scopes | |----------|--------| | Runs | `runs:read`, `runs:write` | | Agents | `agents:read`, `agents:write` | | Conversations | `conversations:read`, `conversations:submit`, `conversations:delete` | | Metrics | `metrics:read`, `metrics:write`, `metrics:delete` | | Test Sets | `test-sets:read`, `test-sets:write` | | Test Cases | `test-cases:read`, `test-cases:write` | | Personas | `personas:read`, `personas:write`, `personas:delete` | | Simulations | `simulations:read`, `simulations:write` | | Traces | `traces:read`, `traces:write` | | Dashboards | `dashboards:read`, `dashboards:write`, `dashboards:delete` | | Scheduled Runs | `scheduled-runs:read`, `scheduled-runs:write`, `scheduled-runs:delete` | | Run Templates | `run-templates:read`, `run-templates:write`, `run-templates:delete` | | API Keys | `api-keys:read`, `api-keys:write`, `api-keys:delete` | > **Tip:** Use the **preset buttons** to quickly configure common permission sets like **Read Only**, **Run Evaluations**, or **Upload Conversations**. ## Managing Key Status Each key has a lifecycle status that controls whether it can authenticate requests. | Status | Description | |--------|-------------| | **Active** | The key is working and can authenticate requests | | **Suspended** | Temporarily disabled. Can be reactivated | | **Revoked** | Permanently disabled. Cannot be reactivated | ### Suspending a Key To temporarily disable a key, click the **actions menu** (three dots) on the key row and select **Suspend**. Suspended keys can be reactivated at any time. ### Revoking a Key To permanently disable a key, select **Revoke** from the actions menu. You must provide a reason. Revoked keys cannot be reactivated. > **Warning:** Revoking a key is permanent. Any systems using the key will immediately lose access. ### Deleting a Key To remove a key entirely, select **Delete** from the actions menu. This permanently removes the key record from your organization. ## Filtering Keys Use the status tabs above the table to filter keys by their current status: - **All** — Show all keys - **Active** — Only active keys - **Suspended** — Only suspended keys - **Revoked** — Only revoked keys ## Best Practices Avoid full access keys in production. Scope each key to only the permissions it needs. Create new keys and revoke old ones periodically, especially for production systems. Name keys after their purpose (e.g., "GitHub Actions CI", "Staging Environment") so you can identify them later. Promptly revoke keys that are no longer in use to minimize your attack surface. ## Next Steps - [API Reference](/api-reference/v1/introduction): Explore the full API documentation - [CLI Installation](/cli/installation): Install the Coval CLI and authenticate with your key - [CLI API Keys Commands](/cli/api-keys): Manage API keys programmatically from the command line - [GitHub Actions](/getting-started/github-actions-tutorial): Use API keys in your CI/CD pipeline --- ## Keyboard Navigation Source: https://docs.coval.ai/guides/keyboard-navigation Use Coval efficiently with the keyboard across lists, detail views, and the command palette ## Overview Coval supports a shared keyboard navigation model across major list and detail surfaces. The fastest way to think about it is: - Vertical keys move through rows in tables and sidebars - Horizontal keys move between sibling detail items - `Enter` opens the current row - `Escape` backs out of the current surface - `Cmd+K` opens the command palette > **Tip:** Not every page supports every shortcut, but list pages and result/detail pages > now follow the same general model. ## Global Shortcuts ### Command Palette - `Cmd+K` opens the command palette - Use it to jump to pages, run contextual page actions, and inspect utility commands such as debug information ### Search - `/` focuses the primary search or filter input on supported pages ### Dismiss and Back - `Escape` closes dialogs, clears the current keyboard selection, or backs out of the current surface - `b` goes back within the current surface on pages that support a local “back to…” action ## Row Navigation On table and list pages, the keyboard-active row uses the same highlight model as mouse hover. Use any of these keys to move through rows: - `j` / `k` - `w` / `s` - `ArrowUp` / `ArrowDown` Then use: - `Enter` to open the selected row This applies to major list surfaces such as Runs, Conversations, Personas, Templates, Test Sets, Metrics, Targets, and similar secondary sidebars. ## Detail Navigation On result and detail pages that belong to a sequence, use horizontal navigation to move between sibling items: - `h` / `l` - `a` / `d` - `ArrowLeft` / `ArrowRight` Examples include simulation result viewers and other detail pages with previous and next controls in the upper-right header. ## Working With Searchable Lists On searchable lists: - `/` focuses the search field - `Escape` clears the search when the field has text - Press `Escape` again to continue unwinding out of the current surface ## Command Palette Hierarchy The command palette is organized from most contextual to most general: - `Focused Item` - `Current Surface` - `Current Page` - `Go To` - `Utility` This makes it easier to find “what can I do right here?” before broader app navigation. ## Practical Flow Here is a common pattern for keyboard-only work: 1. Use `Cmd+K` to jump to a page 2. Use `j` / `k` to move through rows 3. Press `Enter` to open the selected item 4. Use `h` / `l` to cycle through sibling results or detail items 5. Use `b` or `Escape` to back out --- ## Human Review via API Source: https://docs.coval.ai/guides/human-review-api Step-by-step guide to managing human review projects and annotations programmatically ## Overview The Coval Reviews API lets you programmatically create review projects, assign reviewers, and submit ground-truth annotations. This is useful for integrating human review into CI/CD pipelines, bulk-labeling workflows, or custom review dashboards. > **Info:** All requests require an `X-API-Key` header. See the [API Keys guide](/guides/api-keys) for setup. ## Key Concepts - **Review Projects** group simulations, metrics, and assignees together. Creating a project auto-generates annotations for every (simulation, metric, assignee) combination. - **Review Annotations** are individual review tasks. Each annotation links a simulation output to a metric and an assignee. Providing a ground-truth value auto-completes the annotation. > **Tip:** Using Claude Code? We have [skills to support human review](https://github.com/coval-ai/coval-external-skills/tree/main/skills/human-review) in your workflow. ## Step-by-Step: Create and Complete a Review Project **Step: Create a review project** Link your simulations, metrics, and assignees into a project. This auto-generates one annotation per (simulation, metric, assignee) combination. > **Info:** **Finding your IDs:** Retrieve metric IDs via [`GET /v1/metrics`](/api-reference/v1/metrics/metrics/list-metrics) and simulation IDs via [`GET /v1/simulations`](/api-reference/v1/simulations/simulations/list-simulations). Both endpoints return an `id` field for each resource. ```bash curl -X POST https://api.coval.dev/v1/review-projects \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "display_name": "Q1 Voice Agent Review", "description": "Review accuracy and latency for Q1 voice simulations", "assignees": ["alice@company.com", "bob@company.com"], "linked_simulation_ids": ["sim-output-001", "sim-output-002"], "linked_metric_ids": ["metric-accuracy", "metric-latency"], "project_type": "PROJECT_COLLABORATIVE", "notifications": true }' ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `display_name` | string | **Yes** | Human-readable project name | | `assignees` | string[] | **Yes** | Reviewer email addresses (at least one) | | `linked_simulation_ids` | string[] | **Yes** | Simulation output IDs to review | | `linked_metric_ids` | string[] | **Yes** | Metric IDs to evaluate | | `description` | string | No | Optional project description | | `project_type` | string | No | `PROJECT_COLLABORATIVE` (shared queue) or `PROJECT_INDIVIDUAL` (per-reviewer queues). Defaults to `PROJECT_INDIVIDUAL` | | `notifications` | boolean | No | Enable email notifications for assignees. Defaults to `true` | > **Tip:** Use `PROJECT_COLLABORATIVE` when you want one ground-truth label per conversation. Use `PROJECT_INDIVIDUAL` to measure inter-annotator agreement. **Step: List annotations for the project** After creating a project, annotations are auto-generated. List them to see what needs to be reviewed. ```bash curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22" \ -H "X-API-Key: " ``` Filter annotations by status to find pending work: ```bash curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22%20AND%20completion_status%3D%22PENDING%22" \ -H "X-API-Key: " ``` | Parameter | Description | |-----------|-------------| | `filter` | AIP-160 filter expression. Supports `simulation_output_id`, `metric_id`, `assignee`, `status`, `completion_status`, `project_id` | | `page_size` | Results per page (1–100, default 50) | | `page_token` | Pagination token from previous response | | `order_by` | Sort field with optional `-` prefix for descending. Valid: `create_time`, `update_time`, `assignee`, `priority` | **Step: Submit ground-truth values** Update each annotation with the reviewer's ground-truth assessment. Providing a ground-truth value automatically sets `completion_status` to `COMPLETED`. **For numeric metrics** (e.g., latency, numerical scores): ```bash curl -X PATCH https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "ground_truth_float_value": 0.85, "reviewer_notes": "Agent responded accurately but with slight delay" }' ``` **For string/categorical metrics** (e.g., binary pass/fail, sentiment): ```bash curl -X PATCH https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "ground_truth_string_value": "PASS", "reviewer_notes": "Correct greeting and resolution" }' ``` | Field | Type | Description | |-------|------|-------------| | `ground_truth_float_value` | number | Ground-truth numeric value (auto-completes annotation) | | `ground_truth_string_value` | string | Ground-truth string value (auto-completes annotation) | | `ground_truth_subvalues_by_timestamp` | array | Ground-truth subvalues keyed by timestamp (for audio region or per-segment metrics) | | `reviewer_notes` | string | Free-text reviewer notes | | `assignee` | string | Reassign to a different reviewer | | `priority` | string | `PRIORITY_PRIMARY` or `PRIORITY_STANDARD` | **Step: Track project progress** Re-fetch the project and its annotations to check completion status. ```bash # Get project details curl https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " # Count completed annotations curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22%20AND%20completion_status%3D%22COMPLETED%22&page_size=1" \ -H "X-API-Key: " ``` **Step: Use results to improve metrics** Once annotations are complete, navigate to your metric in the [Coval Dashboard](https://app.coval.dev) to see agreement scores between human labels and AI-generated values. Use the metrics studio to draft and test improved metric prompts. ## Managing Review Projects ### List Projects ```bash curl "https://api.coval.dev/v1/review-projects?order_by=-create_time&page_size=10" \ -H "X-API-Key: " ``` ### Update a Project Add new assignees or link additional simulations: ```bash curl -X PATCH https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "assignees": ["alice@company.com", "bob@company.com", "charlie@company.com"], "linked_simulation_ids": ["sim-output-001", "sim-output-002", "sim-output-003"] }' ``` ### Delete a Project ```bash curl -X DELETE https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " ``` ## Managing Review Annotations ### Create a Standalone Annotation You can create annotations outside of a project for ad-hoc reviews: ```bash curl -X POST https://api.coval.dev/v1/review-annotations \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "simulation_output_id": "sim-output-abc123", "metric_id": "metric-accuracy-001", "assignee": "reviewer@company.com" }' ``` ### Get a Single Annotation ```bash curl https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " ``` ### Delete an Annotation ```bash curl -X DELETE https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " ``` Reviewers can also complete their assignments directly in the [Human Review platform](/guides/improving-metrics-with-human-review). --- ## Testing Across Audio Qualities Source: https://docs.coval.ai/guides/testing-across-audio-qualities Compare how a voice agent performs with different voices, speaking styles, volume, interruptions, and background noise. Use audio-quality testing when you want to know whether a voice agent still succeeds under real-world conditions. Run the same agent, test set, and metrics across multiple audio-quality scenarios, then compare the results in a multi-run report. This workflow is for voice simulations. Chat simulations do not exercise speech recognition, text-to-speech, audio timing, or background-noise handling. The goal is not to produce a leaderboard. The goal is to find which real-world audio conditions change outcomes, then decide whether the next fix belongs in your agent prompt, tool handling, speech recognition setup, generated voice setup, tracing, or audio-scenario coverage. ## Use An AI Agent If you use [Coval Agent Skills](/agents/skills), an AI agent can help with both the workflow and the follow-up analysis. Use the [run-audio-quality-testing skill](https://github.com/coval-ai/coval-external-skills/tree/main/skills/runs/run-audio-quality-testing) to create the audio-scenario runs and multi-run report. After the report exists, use the [analyze-audio-quality-report skill](https://github.com/coval-ai/coval-external-skills/tree/main/skills/reports/analyze-audio-quality-report) to turn the report into recommended agent fixes. To have an AI agent run this workflow for you, paste this prompt into your coding agent or local LLM: ```text Use the Coval `run-audio-quality-testing` skill: https://github.com/coval-ai/coval-external-skills/tree/main/skills/runs/run-audio-quality-testing I want to test this voice agent across audio-quality scenarios: Use this test set: Run the same sampled cases, metrics, and agent configuration across the built-in audio-quality personas: Standard Customer, Impatient Customer, Confused Customer, Interruptive Speaker, Super Fast Speaker, High Background Noise Speaker, and Low Volume Speaker. Create the runs, wait for them to finish, then create a multi-run report grouped by Persona so I can compare each audio-quality scenario against Standard Customer. After the report exists, summarize the largest regressions and include the report URL plus representative simulation links. ``` ## 1. Choose A Voice Agent Pick one voice agent to test. For the cleanest comparison, keep the agent configuration fixed across all runs. For agents that emit traces, include trace-based timing metrics such as time to first byte or provider latency. If your agent is not sending traces yet, set up [OpenTelemetry traces](/concepts/simulations/traces/opentelemetry) so Coval can measure agent-side timing and tool behavior alongside the recording. You can also have your coding agent help instrument traces using the [Coval tracing skills](/concepts/simulations/traces/tracing-skills). ## 2. Choose The Audio-Quality Personas Select Standard Customer plus the built-in audio-quality personas: | Coval Persona | What It Tests | | --- | --- | | Standard Customer | Baseline clean-call behavior | | Impatient Customer | Short answers and lower patience | | Confused Customer | Clarification handling | | Interruptive Speaker | Overlap and interruption handling | | Super Fast Speaker | Fast speech | | High Background Noise Speaker | Background noise robustness | | Low Volume Speaker | Quiet speaker audio | Use the same test set for every audio-quality persona. If you subsample a test set, keep the same sampled cases across personas so differences come from the audio condition, not case selection. ## 3. Select Metrics Use metrics that separate task success from audio-path behavior: | Goal | Useful Metrics | | --- | --- | | Task outcome | Composite evaluation, task-completion LLM judges, or scenario-specific pass/fail metrics | | Responsiveness | Latency, time to first audio, trace TTFB, or provider response-time metrics | | Speech recognition | STT Word Error Rate for traced agents, or Transcription Error for WER from recorded conversation audio without requiring agent traces | | Generated voice quality | Voice Quality and Speech Artifact Anomaly for broad generated-speech quality; Clipping Artifact Detection, Dropout Artifact Detection, Codec Artifact Detection, Loop Detection, Phoneme Stretch, Syllable Rate, Timbre Drift, or Pause Analysis for specific failure modes | | Conversation flow | Turn count, audio duration, early termination, abnormally short or long calls, interruption rate, silence, and turn-level timing metrics | Do not use Percent Audio Above 300Hz as a perceived audio-quality score. It measures pitch distribution, not listener-rated quality. ## 4. Launch The Runs Launch voice simulations with: - one agent - one test set - the built-in audio-quality personas listed above - the same metrics for every audio-quality scenario Coval creates separate runs for each selected persona. In this workflow, each persona represents one audio-quality scenario. This keeps each scenario comparable while still letting you analyze the set together. ## 5. Compare Audio Quality Scenarios After the runs finish: 1. Open the runs list. 2. Select the completed runs from the audio-quality persona set. 3. Create a multi-run report. 4. Set **Compare by** to **Persona**. 5. Use the grouped view to compare aggregate scores and latency across audio-quality scenarios. Look for regressions that appear only under specific audio conditions. For example, a high task-success baseline with worse results for the High Background Noise Speaker suggests an audio-path robustness issue rather than a general agent-quality issue. Also scan for `UNKNOWN`, missing, or unscored metric results. Under heavy audio stress, a judge may be unable to evaluate the conversation because the call ended early, the transcript is too sparse, or the interaction became too anomalous. Treat that as a signal to inspect the recording, not just as missing data. ## 6. Spot-Check Simulations > **Warning:** Do not stop at pass/fail metric columns. A scenario can pass binary task metrics while the recording shows a broken or materially different experience. Treat very short calls, very long calls, latency spikes, and `UNKNOWN` or unscored metrics as spot-check triggers. Open representative completed simulations from each audio-quality scenario, especially the lowest-scoring and most surprising rows from the grouped report. Listen to the recording and read the transcript to confirm how your agent handled the audio condition. | Audio Condition | What To Review | | --- | --- | | Background noise | Whether your agent confirms important fields, recovers from misheard details, and avoids repeatedly asking for information the speaker already provided. | | Fast speech | Whether your agent keeps up with compressed turns, asks for clarification when needed, and still reaches the required outcome. | | Low volume | Whether quiet speaker audio causes missed details, extra confirmation loops, or incorrect task outcomes. | | Interruptions | Increased interruption rate over longer calls, and whether your agent recovers after overlap instead of losing state or talking past the caller. | If the listening pass affects a release decision, send representative simulations to [Human Review](/guides/improving-metrics-with-human-review). Use a review project to collect ground-truth labels for questions such as whether your agent captured the required information, recovered after interruptions, handled transcript errors, and completed the task. Use **Collaborative** mode when you want one shared answer per simulation, or **Individual** mode when you want independent reviewer agreement. ## 7. Understand The Results Set **Compare by** to **Persona** and use the grouped view so each row represents one audio-quality persona. Compare every persona against Standard Customer, then inspect the scenarios whose task success, latency, speech recognition, generated voice quality, or call shape changed the most. In your analysis, lead with the conclusions that explain what changed: - the largest audio-quality scenario regressions compared with Standard Customer - the affected task-success, latency, speech-recognition, or audio metrics - any `UNKNOWN`, missing, or unscored metric results that point to anomalous conversations - representative simulation links for the most important regressions and one healthy baseline - Human Review results or reviewer agreement, if you used manual labels - the recommended next step from your report analysis, such as prompt changes, tool handling fixes, STT/TTS adjustments, trace setup, or expanded audio-scenario coverage To have an AI agent produce this analysis from the report, use the [analyze-audio-quality-report skill](https://github.com/coval-ai/coval-external-skills/tree/main/skills/reports/analyze-audio-quality-report): ```text Use the Coval `analyze-audio-quality-report` skill. I completed the Testing Across Audio Qualities workflow and created this multi-run report: Analyze the report by audio-quality scenario against Standard Customer. If the report is grouped by Persona in Coval, interpret those personas as the audio-quality scenarios being tested. Use metric deltas, UNKNOWN or unscored metrics, call-shape changes such as turn count and audio duration, representative simulations, recordings, transcripts, traces if present, and Human Review labels if present. Tell me: - which audio-quality scenarios regressed most or became anomalous, and on which metrics - what likely failed in my agent - the recommended next fix, such as prompt changes, tool handling fixes, STT/TTS adjustments, trace setup, or expanded audio-scenario coverage - which audio-quality scenarios, test cases, and metrics I should rerun after the fix to confirm improvement Keep agent-side fixes separate from Coval metric or test setup changes. Do not guess from the aggregate table alone; inspect representative simulations when they are available. ``` --- ## Conversations Source: https://docs.coval.ai/guides/observability Guide to uploading transcripts and audio to Coval for conversation evaluation. ## Overview This guide provides comprehensive documentation for uploading transcripts to Coval for conversation evaluation. Coval supports multiple formats. ### Video Tutorial [Video: Coval Observability Tutorial](https://www.youtube.com/embed/ocjxU6Eevyo) ### Required Fields > **Info:** **Essential transcript fields:** > > - **`role`**: Must be one of `"user"`, `"assistant"`, `"system"`, or `"tool"` > - **`content`**: The actual message content (string) > - **`beginning`**: Index position in the conversation (number) > - **`end`**: End position in the conversation (number) ### Optional Fields - **`start_timestamp`**: Unix timestamp for when the message started (number) - **`end_timestamp`**: Unix timestamp for when the message ended (number) - **`error`**: Error message if transcription failed (string) - **`transcriptionError`**: Boolean flag indicating transcription error - **`name`**: Name identifier for the message (string) ## Supported Formats **OpenAI Format (Recommended):** The system primarily expects transcripts in OpenAI's chat completion format: ```json [ { "role": "user", "content": "Hello, I would like assistance.", "start_time": 0.0, "end_time": 3.2 }, { "role": "assistant", "content": "Of course! How can I help you today?", "start_time": 3.2, "end_time": 6.8 }, { "role": "user", "content": "I'm having an issue with my recent order.", "start_time": 6.8, "end_time": 10.5 }, { "role": "assistant", "content": "I'm sorry to hear that. Could you provide me with your order number?", "start_time": 10.5, "end_time": 14.2 } ] ``` **Extended Studio Format:** For detailed transcripts with timing information: ```json [ { "role": "user", "content": "Hello, I would like assistance.", "start_time": 0.0, "end_time": 3.2, "beginning": 0, "end": 1, "start_timestamp": 1640995200, "end_timestamp": 1640995210 }, { "role": "assistant", "content": "Of course! How can I help you today?", "start_time": 3.2, "end_time": 6.8, "beginning": 1, "end": 2, "start_timestamp": 1640995210, "end_timestamp": 1640995220 } ] ``` **Raw Text Format:** The system can also accept raw text, which will be automatically converted: ``` User: Hello, I would like assistance. Assistant: Of course! How can I help you today? User: I'm having an issue with my recent order. Assistant: I'm sorry to hear that. Could you provide me with your order number? ``` ## Tool Call Messages For tool call messages, the `content` field should contain a JSON string that can be parsed to extract tool information. ### Tool Call Content Examples ```json Simple Tool Call { "role": "tool", "content": "{\"tool\": \"waiting_on_customer\"}", "start_time": 12.0, "end_time": 12.5, "beginning": 3, "end": 4 } ``` ```json Tool Call with Arguments { "role": "tool", "content": "{\"query\": \"search term\", \"tool\": \"query_knowledge\"}", "start_time": 15.2, "end_time": 15.8, "beginning": 4, "end": 5 } ``` ```json Standard Tool Call Format { "role": "tool", "content": "{\"tool_call\": \"function_name\", \"arguments\": {\"param1\": \"value1\"}}", "start_time": 18.5, "end_time": 19.1, "beginning": 5, "end": 6, "name": "function_name" } ``` ```json System Role with Tool Call { "role": "system", "content": "{\"tool_call\": \"database_query\", \"arguments\": {\"table\": \"users\"}}", "start_time": 22.0, "end_time": 22.3, "beginning": 6, "end": 7 } ``` ### Alternative Tool Call Formats The system supports these formats in the `content` field: 1. **Function format**: `{"function": "name", "arguments": {...}}` 2. **Tool format**: `{"tool": "name", ...}` (other fields become arguments) 3. **Custom backend format**: `{tool_call: name, arguments: {...}}` ## Validation Rules ### Content Limits > **Warning:** **Important limits to keep in mind:** > > - **Individual message content**: Maximum 1,000 characters > - **Total transcript size**: Maximum 40MB > - **Number of messages**: Maximum 1,000 messages per transcript ### Role Validation - Only `"user"`, `"assistant"`, `"system"`, and `"tool"` roles are accepted - Each message must have `role`, `content`, `start_time`, and `end_time` fields - `start_time` and `end_time` must be float values representing seconds ### Role Normalization For monitoring and evaluation purposes, roles may be normalized: - `"system"` messages with tool call content may be treated as `"tool"` for display purposes - Tool calls in `"system"` role are automatically detected and parsed - The UI will display tool calls with appropriate icons and formatting regardless of the original role ### Timing Validation - `beginning` and `end` values should be sequential integers - `start_timestamp` and `end_timestamp` should be valid Unix timestamps - If timestamps are provided, `end_timestamp` should be greater than `start_timestamp` ### Audio Requirements Both stereo and mono audio files are supported. Stereo is recommended when available because speaker roles are assigned deterministically from channel position; mono roles are inferred from transcript content. **Stereo (recommended):** Upload audio with the agent and user on separate channels. Channel position determines role: | Channel | Position | Role | |---------|----------|------| | Channel 0 | Left | Agent | | Channel 1 | Right | User | **Mono:** Upload a single-channel file. Roles are assigned by classifying the transcript content with an LLM — typically accurate, but less reliable than channel-based stereo mapping for short or ambiguous conversations. --- ## Scheduled Runs Source: https://docs.coval.ai/guides/scheduled-runs Automate recurring evaluations to catch regressions and monitor agent quality over time ## Overview Scheduled Runs let you run evaluations on a recurring cadence—hourly, daily, weekly, or on a custom interval. They're built on top of [Templates](/concepts/templates/overview), which capture your full evaluation configuration (agent, test set, personas, metrics, and mutations). Each time the schedule fires, a new run is launched automatically with those exact parameters. Common use cases: - **Regression detection**: Catch when a new deployment breaks expected behaviors - **Continuous quality monitoring**: Track metric trends across agent versions - **Health checks**: Validate your agent is responsive and performing correctly at regular intervals ## Prerequisites Before creating a scheduled run, you need: 1. **An agent configured** in Coval — see [Agents](/concepts/agents/overview) 2. **A test set** with the conversation scenarios to run — see [Test Sets](/concepts/test-sets/overview) 3. **At least one metric** selected for evaluation — see [Metrics](/concepts/metrics/overview) 4. **A saved template** that ties these together — see [Templates](/concepts/templates/overview) ## Full Setup Flow **Step: Create a Template** Navigate to **Templates** in the sidebar and click **New Template**. Configure your evaluation: - **Agent**: The voice or chat agent to test - **Test Set**: The conversation scenarios to run - **Persona(s)**: How the simulated user should behave - **Iterations**: How many times each test case runs - **Concurrency**: How many simulations run in parallel - **Metrics**: Which metrics to evaluate against Click **Create Template** to save. This template will be the source of truth for every scheduled run — all parameters are inherited automatically. **Step: Schedule the Template** From the **Templates** list, click **Schedule** on your template. This opens the Schedule Evaluation dialog. Fill in the schedule configuration: **Name** Give the scheduled run a descriptive name (e.g., "Nightly Regression – Disputes Flow"). This appears in the Scheduled Runs list and in run history. **Schedule Type** Choose between two scheduling modes: **Interval-based:** Runs fire at a fixed cadence from when the schedule is created. Select a quick preset: | Preset | Interval | |--------|----------| | 15 min | Every 15 minutes | | 30 min | Every 30 minutes | | 1 hour | Every hour | | 6 hours | Every 6 hours | | 12 hours | Every 12 hours | | Daily | Once per day | | Weekly | Once per week | | Monthly | Every 30 days | Or set a **Custom Interval** by entering a number and selecting minutes, hours, or days. The minimum is 15 minutes and the maximum is 30 days. **Time of Day:** Runs fire at a specific time, either daily or on selected days of the week. - **Time**: Set the hour, minute, and AM/PM. Times are anchored to your local timezone. - **Recurrence**: Choose **Daily** (every day at that time) or **Weekly** (select specific days). - For weekly schedules, select which days of the week to run (e.g., Mon–Fri for weekdays only). Click **Schedule** to create the scheduled run. It activates immediately. **Step: Monitor Your Scheduled Runs** Navigate to **Scheduled Runs** in the sidebar to see all your configured schedules. The list shows: - **Status**: Active (running on schedule) or Disabled (paused) - **Name**: The label you gave the scheduled run - **Agent**: Which agent is being evaluated - **Schedule**: Human-readable frequency (e.g., "Daily at 9:00 AM", "Every 6 hours") - **Template**: The template powering this schedule (click to view its configuration) - **Created**: When the schedule was set up Use the **search bar** to filter by name or ID, or filter the list by **Active** / **Disabled** status. Click any row to open the run history for that schedule — you'll see every evaluation it has launched, with pass/fail results for each metric. ## Managing Scheduled Runs ### Enable and Disable To pause a schedule without deleting it, open the actions menu (`⋮`) on any row and select **Disable**. Re-enable it the same way. You can also bulk-enable or bulk-disable: check multiple rows, then use the **Enable Selected** or **Disable Selected** buttons in the toolbar that appears. ### Edit a Schedule To change the name or timing of an existing schedule, open the actions menu and select **Edit Schedule**. You can update the display name, switch between interval and time-of-day modes, or adjust the frequency. > **Note:** Editing a schedule does not affect the underlying template or evaluation parameters — only the timing changes. To update what gets evaluated (agent, metrics, test cases), edit the template directly. ### Delete a Schedule Scheduled runs must be **disabled before they can be deleted**. Once disabled, open the actions menu and select **Delete**. This action is permanent. To delete multiple schedules at once, disable them first, then select them and use **Delete Selected**. ## Viewing Run History Click any scheduled run to open its detail page. Here you can see: - Every run triggered by this schedule - The pass/fail result for each metric per run - Trend data showing how metric scores change over time This is useful for spotting regressions: if a metric score drops across consecutive runs, something likely changed in your agent or its environment. ## Tips - **Start with daily schedules** during active development. Hourly is better suited for production monitoring where you need fast feedback. - **Name schedules clearly** — include the agent name and what it tests (e.g., "Hourly – Billing Bot – Core Flows"). - **Use the Template link** in the Scheduled Runs list to quickly verify what configuration is being used before debugging a failing run. - **Disable rather than delete** schedules you might want to resume — deleted schedules and their history are gone permanently. --- ## GitHub Actions Source: https://docs.coval.ai/getting-started/github-actions-tutorial Launch Coval evaluation runs from GitHub Actions ## Prerequisites **Step: Get Your API Key** Navigate to your [Coval dashboard](https://app.coval.dev) and generate an API key from Settings. **Step: Add GitHub Secret** 1. Go to your repository **Settings > Secrets and variables > Actions** 2. Click **New repository secret** 3. Name: `COVAL_API_KEY` 4. Value: Your Coval API key **Step: Gather Required IDs** You'll need the following identifiers: - **Agent ID** (22 chars): Found in Agents page → Select agent → Copy ID - **Persona ID** (22 chars): Found in Personas page → Select persona → Copy ID - **Test Set ID** (8 chars): Found in Test Sets page → Select test set → Copy ID - **Metric IDs** (22 chars each, optional): Found in Metrics page → Click metric → Copy ID ## Quick Start ### Automatic PR Checks Create `.github/workflows/coval-eval.yml`: ```yaml name: Coval Evaluation on: pull_request: branches: [main] jobs: evaluate-agent: runs-on: ubuntu-latest steps: - name: Run Coval Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" ``` ### Manual Workflow Dispatch Create `.github/workflows/manual-eval.yml`: ```yaml name: Manual Evaluation on: workflow_dispatch: inputs: agent_id: description: "Agent ID (22 characters)" required: true type: string persona_id: description: "Persona ID (22 characters)" required: true type: string test_set_id: description: "Test Set ID (8 characters)" required: true type: string jobs: evaluate: runs-on: ubuntu-latest steps: - name: Run Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: ${{ inputs.agent_id }} persona_id: ${{ inputs.persona_id }} test_set_id: ${{ inputs.test_set_id }} ``` To trigger: 1. Navigate to **Actions** tab 2. Select **Manual Evaluation** 3. Click **Run workflow** 4. Enter your IDs and click **Run workflow** ## Advanced Configuration ### Custom Metrics and Options ```yaml - name: Advanced Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" # Specific metrics to evaluate metric_ids: '["iM5lM1oRs4zTu7wY0aBdEe", "jN6mN2pSt5aUv8xZ1bCeFf"]' # Run each test case 3 times iteration_count: 3 # Run 2 simulations concurrently concurrency: 2 # Custom metadata for tracking metadata: '{"campaign": "q4_2025", "env": "staging"}' ``` ### Using Outputs ```yaml - name: Run Evaluation id: coval uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" - name: Post Results run: | echo "Run ID: ${{ steps.coval.outputs.run_id }}" echo "Status: ${{ steps.coval.outputs.status }}" echo "View: ${{ steps.coval.outputs.run_url }}" - name: Comment on PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: '✓ Evaluation complete: ${{ steps.coval.outputs.run_url }}' }) ``` ## Configuration Reference ### Inputs | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `agent_id` | string | Yes | - | Agent to test (22 chars) | | `persona_id` | string | Yes | - | Simulated persona (22 chars) | | `test_set_id` | string | Yes | - | Test set with test cases (8 chars) | | `metric_ids` | JSON array | No | Agent defaults | Metric IDs to evaluate (22 chars each) | | `iteration_count` | integer | No | `1` | Runs per test case (1-10) | | `concurrency` | integer | No | `1` | Concurrent simulations (1-5) | | `metadata` | JSON object | No | `{}` | Custom metadata for tracking | | `max_wait_time` | integer | No | `600` | Max wait time in seconds | | `check_interval` | integer | No | `30` | Status check interval in seconds | ### Outputs | Output | Type | Description | |--------|------|-------------| | `run_id` | string | Unique run identifier | | `status` | string | Final status (COMPLETED, FAILED, etc.) | | `run_url` | string | Dashboard URL to view results | ### Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `COVAL_API_KEY` | Yes | Your Coval API key | ## API Details The action uses the Coval v1 Runs API: ### Launch Run **Endpoint:** `POST https://api.coval.dev/v1/runs` **Request:** ```json { "agent_id": "gk3jK9mPq2xRt5vW8yZaBc", "persona_id": "hL4kL0nQr3ySt6vX9zAcDd", "test_set_id": "aB1cD2eF", "metric_ids": ["iM5lM1oRs4zTu7wY0aBdEe"], "options": { "iteration_count": 3, "concurrency": 2 }, "metadata": { "campaign": "q4_2025" } } ``` **Response:** ```json { "run": { "run_id": "8EktrIgaVxn9LfxkIynagX", "status": "PENDING", "create_time": "2025-10-14T12:00:00Z" } } ``` ### Monitor Run **Endpoint:** `GET https://api.coval.dev/v1/runs/{run_id}` **Response:** ```json { "run": { "run_id": "8EktrIgaVxn9LfxkIynagX", "status": "IN PROGRESS", "progress": { "total_test_cases": 10, "completed_test_cases": 5, "failed_test_cases": 0, "in_progress_test_cases": 1 } } } ``` ### Run Statuses | Status | Description | |--------|-------------| | `PENDING` | Waiting to start | | `IN QUEUE` | Queued for execution | | `IN PROGRESS` | Running test cases | | `COMPLETED` | Successfully completed | | `FAILED` | Run failed | ## Examples ### Environment-Based Testing ```yaml name: Multi-Environment Testing on: push: branches: [main, staging, dev] jobs: evaluate: runs-on: ubuntu-latest steps: - name: Set Environment id: env run: | if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then echo "agent=prodAgentId12345678" >> $GITHUB_OUTPUT echo "env=production" >> $GITHUB_OUTPUT elif [[ "${{ github.ref }}" == "refs/heads/staging" ]]; then echo "agent=stgAgentId123456789" >> $GITHUB_OUTPUT echo "env=staging" >> $GITHUB_OUTPUT else echo "agent=devAgentId123456789" >> $GITHUB_OUTPUT echo "env=development" >> $GITHUB_OUTPUT fi - name: Evaluate uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: ${{ steps.env.outputs.agent }} persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" metadata: '{"env": "${{ steps.env.outputs.env }}", "commit": "${{ github.sha }}"}' ``` ### Parallel Persona Testing ```yaml name: Multi-Persona Testing on: workflow_dispatch: jobs: test: runs-on: ubuntu-latest strategy: matrix: persona: - { id: "persona1234567890abcd", name: "Friendly" } - { id: "persona1234567890efgh", name: "Frustrated" } - { id: "persona1234567890ijkl", name: "Technical" } steps: - name: Test ${{ matrix.persona.name }} uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: ${{ matrix.persona.id }} test_set_id: "aB1cD2eF" metadata: '{"persona": "${{ matrix.persona.name }}"}' ``` ### Scheduled Regression Testing ```yaml name: Nightly Regression on: schedule: - cron: '0 2 * * *' # 2 AM daily jobs: regression: runs-on: ubuntu-latest steps: - name: Run Tests uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "regrTest" iteration_count: 5 concurrency: 3 max_wait_time: 1800 ``` ## Troubleshooting ### Invalid API Key ``` Status Code: 401 Error Code: UNAUTHENTICATED Message: Invalid or missing API key ``` **Solution:** Verify `COVAL_API_KEY` is set correctly in GitHub Secrets. ### Invalid Agent ID ``` Status Code: 400 Error Code: INVALID_ARGUMENT Message: Invalid agent_id: Agent not found ``` **Solution:** Confirm the agent ID is 22 characters and exists in your organization. ### Validation Errors ``` Status Code: 400 Details: - iteration_count: Value must be between 1 and 10 ``` **Solution:** Ensure all parameters meet the constraints listed in the Configuration Reference. ### Timeout **Solution:** Increase `max_wait_time` for larger test sets or check the Coval dashboard for run status. ### Invalid JSON ```yaml # Wrong - will fail metric_ids: ["id1", "id2"] # Correct - use single quotes around JSON metric_ids: '["id1", "id2"]' ``` ## Resources - [API Reference](/api-reference/v1/runs/runs/launch-run) - [Coval Dashboard](https://app.coval.dev) - [GitHub Action Repository](https://github.com/coval-ai/coval-github-action) - [Support](mailto:support@coval.dev) --- ## Overview Source: https://docs.coval.ai/use-cases/overview Explore real-world examples of how teams use Coval to evaluate and improve their AI agents. --- ## Airline Help Desk Source: https://docs.coval.ai/use-cases/leveraging-test-users This example demonstrates how to leverage test users to evaluate an airline help desk voice agent. We'll assume the voice agent has access to an internal system that maintains customer accounts. ## Goal Ensure that the airline voice agent books users on the correct flights. ## Step One: Configure Your Agent Attributes The first step in testing the agent is to configure a list of test users that exist in the agent's internal system. These users will be used throughout your test sets. Navigate to [the Agent Details page](https://app.coval.dev/coval/agents) and add the following attributes: ```json { "qa_accounts": { "user1": { "tier": "platinum", "miles": 100000, "credit_card": "379923037966854", "user_token": "duhfsaihd1234567654323456789" }, "user2": { "tier": "standard", "miles": 8, "credit_card": "4134823389064963", "user_token": "8976890dfaoisfuapsd80873248179" } } } ``` Notice that `user1` is a platinum member while `user2` is a standard member. This allows us to compare the agent's behavior between different user tiers. ## Step Two: Create Booking Test Cases ![Example Booking Test Cases](/images/use-cases/leveraging-users/test-case.png) Now we can use these users in test cases. Let's examine the first test case configuration. ### Setting Up Test Case Metadata The goal of the metadata is to store values we need for deterministic validation. When we create metrics later, we'll need to know the exact flight path (in airport codes) to perform simple comparisons on the ticket. For this test case, we configure the following metadata: - **source**: `SFO` - **destination**: `LAX` - **user**: `user1` The `user` field identifies which user account the flight will be booked on, allowing us to verify the booking was made for the correct account. ### Test Case Prompt ```markdown You are calling an airline help desk. Book a flight from {{test_case.source}} to {{test_case.destination}}. Use your credit card: {{agent.qa_accounts.user1.credit_card}} ``` Using the test case metadata and agent attributes in the prompt allows everything to be fully in sync. That way, you only have to change the value in one place. However, this will ultimately be processed as: ```markdown You are calling an airline help desk. Book a flight from SFO to LAX. Use your credit card: 379923037966854 ``` You can create many permutations of this test case, requesting different sources, with different users, etc. ## Step Three: Create an API State Match Metric After a simulation, we want to check if the airline's internal database has a ticket for our user. In Step Two, you created a test set with many users and ticket combinations. To do this, navigate to the [metric creation page](https://app.coval.dev/appointmentdemo/metrics/create) and create an API State Match metric. ![Example Metric](/images/use-cases/leveraging-users/metric.png) In our example, the airline has an API that allows us to see all tickets for a specific user. It takes in a `userId` and a `user_token`, and outputs a list of tickets. After the simulation, we will call the API with `{{test_case.user}}`, which will be transformed to `user1` for our first test case. ```json { "user": "user1" } ``` We will receive the response: ```json { "tickets": [ { "source": "SFO", "destination": "LAX", "date_booked": "12/12/25", "confirmed": true } ] } ``` Use the match path `tickets[source={{test_case.source}},destination={{test_case.destination}}].confirmed`. This will be rendered as, for example, `tickets[source=SFO,destination=LAX].confirmed` for a given test case. It will select the first ticket that matches both source and destination, and verify the `confirmed` field. If the ticket exists, the metric will return `MATCH`. If the ticket exists but is not confirmed, it will return `DIFF`. If the ticket doesn't exist, it will return `NOT_FOUND`. ## Step Four: Run Your Simulations! Now, we have all the building blocks to run our simulations. --- ## Evaluating Tool Calls Source: https://docs.coval.ai/use-cases/evaluating-tool-calls Use OpenTelemetry traces to validate what your agent's tools actually did — beyond what the transcript shows. We'll trace one insurance call all the way through. The transcript tells you what the agent *said*. It doesn't tell you whether the tool call actually fired, what arguments it received, how long it took, or what it returned. To validate tool calls, instrument your agent with tracing so you can see everything that happened underneath the conversation. This cookbook follows one running example — an **auto-insurance voice agent** named `claims-bot` — from instrumentation all the way to a custom LLM judge that catches it making up data. ## The Example Caller dials `claims-bot` after locking their keys in the car. The agent needs to: 1. Verify the caller's identity (DOB + last 4 of policy number) 2. Look up their roadside-assistance coverage 3. Dispatch a locksmith and quote an ETA A clean run uses three tools: `verify_caller`, `lookup_policy`, `dispatch_roadside`. We want every factual claim the agent makes on the call (the policy tier, the dispatch ETA, the claim ID) to be backed by a real tool result — not invented because a tool errored. ## Step One: Instrument `claims-bot` with OpenTelemetry Send traces from your agent to Coval using the OpenTelemetry SDK. This captures detailed span data — tool calls, LLM invocations, and other operations — and exports it directly to Coval alongside your simulation results. Follow the setup guide: [**OpenTelemetry Traces**](/concepts/simulations/traces/opentelemetry). When you instrument the LLM, emit a `tool_call` span (or `llm_tool_call`, depending on your convention) **per tool invocation** with the tool's arguments, result, and any error. See [**Instrumenting LLM Spans**](/concepts/simulations/traces/opentelemetry#instrumenting-llm-spans) for the exact shape Coval expects. For our `claims-bot` example, a single turn that calls `lookup_policy` looks like this: ```python with tracer.start_as_current_span("tool_call") as span: span.set_attribute("tool.name", "lookup_policy") span.set_attribute("tool.arguments", json.dumps({"policy_id": "P-48213"})) try: result = lookup_policy(policy_id="P-48213") span.set_attribute("tool.result", json.dumps(result)) except ToolError as e: span.set_attribute("tool.error", str(e)) span.set_status(Status(StatusCode.ERROR)) raise ``` After a simulation runs, the trace for our locksmith call should contain three sibling `tool_call` spans under the LLM turn that produced them: `verify_caller`, `lookup_policy`, `dispatch_roadside`. > **Info:** Without `tool_call` spans, Coval can only show you the surrounding conversation. The span is what unlocks every step below. ## Step Two: Inspect the Trace for the Locksmith Call Run a simulation of the locksmith scenario, then open its detail page and click the **Traces** card. You'll see each tool call in order with everything you need to debug it: | What you see in the trace | Example value from the locksmith call | |---|---| | Tool name | `dispatch_roadside` | | Arguments | `{"policy_id": "P-48213", "service": "locksmith", "location": "37.7749,-122.4194"}` | | Result | `{"claim_id": "RA-90412", "eta_minutes": 35, "provider": "Bay Locksmith Co."}` | | Latency | `1.8s` | | Error (if any) | `null` | | Span metadata | `model="gpt-4o", turn=4, parent_span=llm_turn` | In the transcript, the agent says *"I've dispatched a locksmith — claim RA-90412, ETA about 35 minutes."* The trace lets you confirm those numbers came from the tool and weren't hallucinated. This is the fastest way to debug a single failing conversation — you can see exactly which tool returned bad data, timed out, or never fired at all, instead of guessing from the transcript. ## Step Three: Search Tool Calls Across All `claims-bot` Simulations To look across many simulations at once, use **Trace Search**. Try a query like `tool calls in last week` to pull every tool call span from the past 7 days: [**Open Trace Search →**](https://app.coval.dev/coval-traces/traces?span=tool_call&q=tool+calls+in+last+week) For `claims-bot` last week, this surfaces a pattern: `dispatch_roadside` returned `SERVICE_UNAVAILABLE` on **12 of 340 calls** (≈3.5%). From the search results you can: - **Drill into specific simulations** — open any of those 12 calls and see how the agent responded after the error. Did it tell the caller dispatch was unavailable, or did it confidently quote a fake claim ID? - **View the failure matrix** — see whether errors cluster on a specific region, policy tier, or time of day. - **Refine the query** — narrow to `dispatch_roadside SERVICE_UNAVAILABLE` to look only at the failure cohort, or to a single agent if you run multiple. This is where systemic issues surface — "this tool fails on West Coast traffic" or "the agent fabricates ETAs whenever dispatch errors" — that you'd never catch one transcript at a time. ## Step Four: Catch Fabrication With a Custom Trace Metric Drilling into one of the 12 failure cases above, we want to score *automatically* — across every future simulation — whether the agent fabricated data after a tool error. We do this with a custom LLM judge metric that runs over the trace, not just the transcript. Create the metric in the dashboard: - **Name:** `Tool-Grounded Claim Integrity` - **Metric Type:** Text LLM Judge - **Output Type:** Binary (YES / NO) - **`includeTraces=True`** ← this is the critical setting > **Warning:** You **must** toggle **`includeTraces=True`** when creating the metric. Without it the judge only sees the transcript — it has no visibility into your tool spans, so it can't tell a real claim ID from a fabricated one. With it on, the full span tree (tool names, arguments, results, errors) is passed in as TRACE CONTEXT alongside the transcript. For the metric prompt, paste: ```text You are evaluating a voice assistant conversation. You have access to the TRANSCRIPT of the call and the TRACE CONTEXT, which includes `llm_tool_call` spans showing every tool the assistant invoked along with arguments and results. Return YES if ALL of the following are true: - Every specific factual claim the assistant made (appointment times, prescription status, account balances, claim IDs, policy details, roadside dispatch ETAs, etc.) was preceded by a corresponding tool_call span that retrieved that data - When a tool returned an error (for example SERVICE_UNAVAILABLE), the assistant correctly told the caller the information was unavailable and gave alternate guidance — it did NOT fabricate values - Tool call ordering was logical (e.g. looking up a policy before filing a claim under it, verifying identity before disclosing account details) Return NO if ANY of the following are true: - The assistant stated specific data without a preceding tool_call span to retrieve it - The assistant invented plausible-looking values after a tool returned an error - The assistant skipped a required verification or lookup step Be strict. General conversational statements (greetings, empathy, closing) do not require tool calls — only specific factual claims do. ``` Run this against the past week of `claims-bot` simulations. The metric should: - Return **YES** on the clean locksmith run from Step Two (every claim is grounded in a real span). - Return **NO** on calls where `dispatch_roadside` errored but the agent still quoted an ETA — exactly the failure mode Trace Search surfaced in Step Three. You can write this kind of judge as specific or as general as you like. The model sees the full span tree, so anything visible in the trace is fair game — argument shapes, result fields, error codes, span ordering, latency, and so on. ## Coming Soon: Latency Metrics on Tool Calls We're rolling out custom trace metrics that evaluate tool call **timing** directly: - *"How long did tool calls take to return — avg / p50 / p95?"* - *"How long did the `dispatch_roadside` tool specifically take — avg / p95?"* Useful for catching regressions in tool performance over time, or for flagging individual simulations where a tool exceeded an SLO. Stay tuned — coming shortly. --- ## Hackathons Source: https://docs.coval.ai/collaborate/hackathons/overview Take part in the hackathons we support and collaborate with us in advancing responsible AI development by testing your agents! ## Upcoming Hackathons *No upcoming hackathons at this time. Check back soon!* ## Past Hackathons ### Gemini x Pipecat Hackathon **Saturday, October 11, 2024 at 9am** **YC SF Office** We were excited to co-host a special voice and realtime AI hackathon with Daily (W16) and Google DeepMind at the YC office. The event brought together builders to create multimodal AI applications, working alongside engineers from Google, Daily, Boundary (W23), Coval (S24), Langfuse (W23), Tavus (S21), and the AI Tinkerers community. The hackathon featured prizes including a guaranteed YC interview, lunch with Google engineers, and special swag for winners. [Image: Hackathon participants collaborating] [Image: Hackathon team presentations] [Image: Hackathon workspace] [Image: Hackathon networking] --- ## API Reference Source: https://docs.coval.ai/api-reference/v1/introduction The Coval REST API enables you to programmatically launch voice and chat evaluations, manage test data, and analyze AI agent performance. ## Most Used - [Runs](/api-reference/v1/runs/runs/launch-run): Launch evaluation runs and track results across your agents - [Agents](/api-reference/v1/agents/agents/list-agents): Connect and configure your AI agents for testing - [Simulations](/api-reference/v1/simulations/simulations/list-simulations): View simulation results with transcripts and metric scores - [Test Sets](/api-reference/v1/test-sets/test-sets/list-test-sets): Create and manage test cases for your evaluations ## Getting Started **Step: Get your API key** Obtain your API key from the [Coval Dashboard](https://app.coval.dev/settings). See the [API Keys guide](/guides/api-keys) for detailed setup instructions. **Step: Authenticate your requests** Include your API key in the `X-API-Key` header: ```bash curl https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" ``` **Step: Create your first agent** Set up an agent configuration for testing: ```bash curl -X POST https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "display_name": "My Test Agent", "model_type": "VOICE", "phone_number": "+15551234567" }' ``` **Step: Launch a simulation** Run evaluations against your agent using the Simulations API. ## Base URL ``` https://api.coval.dev/v1 ``` ## OpenAPI Specification We publish our OpenAPI specifications at public endpoints (no authentication required). ### List available specs ```bash GET https://api.coval.dev/v1/openapi ``` Returns the available `spec_name` values and URLs. ### Fetch a specific spec ```bash GET https://api.coval.dev/v1/openapi/{spec_name} ``` - **Default response**: YAML (`application/yaml`) - **JSON response**: set `Accept: application/json` ### Examples ```bash # List all available specs curl -s https://api.coval.dev/v1/openapi # Fetch a spec as YAML (default) curl -s https://api.coval.dev/v1/openapi/agents # Fetch a spec as JSON curl -s -H "Accept: application/json" https://api.coval.dev/v1/openapi/agents ``` ## Authentication All API requests require an `X-API-Key` header with every request. --- ## list-runs Source: https://docs.coval.ai/api-reference/v1/runs/runs/list-runs --- openapi: get /runs --- --- ## launch-run Source: https://docs.coval.ai/api-reference/v1/runs/runs/launch-run --- openapi: post /runs --- --- ## get-run Source: https://docs.coval.ai/api-reference/v1/runs/runs/get-run --- openapi: get /runs/{run_id} --- --- ## update-run Source: https://docs.coval.ai/api-reference/v1/runs/runs/update-run --- openapi: patch /runs/{run_id} --- --- ## delete-run Source: https://docs.coval.ai/api-reference/v1/runs/runs/delete-run --- openapi: delete /runs/{run_id} --- --- ## list-simulations Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/list-simulations --- openapi: get /v1/simulations --- --- ## get-simulation Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/get-simulation --- openapi: get /v1/simulations/{simulation_id} --- --- ## delete-or-cancel-simulation Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/delete-or-cancel-simulation --- openapi: delete /v1/simulations/{simulation_id} --- --- ## get-audio-file-url Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/get-audio-file-url --- openapi: get /v1/simulations/{simulation_id}/audio --- --- ## rerun-a-simulation Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/rerun-a-simulation --- openapi: post /v1/simulations/{simulation_id}/resimulate --- --- ## update-simulation Source: https://docs.coval.ai/api-reference/v1/simulations/simulations/update-simulation --- openapi: patch /v1/simulations/{simulation_id} --- --- ## list-metrics Source: https://docs.coval.ai/api-reference/v1/simulations/metric-outputs/list-metrics --- openapi: get /v1/simulations/{simulation_id}/metrics --- --- ## list-conversations Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/list-conversations --- openapi: get /v1/conversations --- --- ## submit-conversation-for-evaluation Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/submit-conversation-for-evaluation --- openapi: post /v1/conversations:submit --- --- ## attach-audio-to-a-conversation Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/attach-audio-to-a-conversation --- openapi: patch /v1/conversations/{conversation_id} --- --- ## get-conversation-details Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/get-conversation-details --- openapi: get /v1/conversations/{conversation_id} --- --- ## delete-or-cancel-conversation Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/delete-or-cancel-conversation --- openapi: delete /v1/conversations/{conversation_id} --- --- ## list-conversation-metrics Source: https://docs.coval.ai/api-reference/v1/conversations/conversations/list-conversation-metrics --- openapi: get /v1/conversations/{conversation_id}/metrics --- --- ## upload-audio Source: https://docs.coval.ai/api-reference/v1/conversations/audio/upload-audio --- openapi: post /v1/audio:upload --- --- ## get-conversation-audio Source: https://docs.coval.ai/api-reference/v1/conversations/audio/get-conversation-audio --- openapi: get /v1/conversations/{conversation_id}/audio --- --- ## list-agents Source: https://docs.coval.ai/api-reference/v1/agents/agents/list-agents --- openapi: get /v1/agents --- --- ## connect-an-agent Source: https://docs.coval.ai/api-reference/v1/agents/agents/connect-an-agent --- openapi: post /v1/agents --- --- ## get-agent Source: https://docs.coval.ai/api-reference/v1/agents/agents/get-agent --- openapi: get /v1/agents/{agent_id} --- --- ## update-agent Source: https://docs.coval.ai/api-reference/v1/agents/agents/update-agent --- openapi: patch /v1/agents/{agent_id} --- --- ## delete-agent Source: https://docs.coval.ai/api-reference/v1/agents/agents/delete-agent --- openapi: delete /v1/agents/{agent_id} --- --- ## list-mutations Source: https://docs.coval.ai/api-reference/v1/mutations/mutations/list-mutations --- openapi: get /v1/agents/{agent_id}/mutations --- --- ## create-mutation Source: https://docs.coval.ai/api-reference/v1/mutations/mutations/create-mutation --- openapi: post /v1/agents/{agent_id}/mutations --- --- ## get-mutation Source: https://docs.coval.ai/api-reference/v1/mutations/mutations/get-mutation --- openapi: get /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## update-mutation Source: https://docs.coval.ai/api-reference/v1/mutations/mutations/update-mutation --- openapi: patch /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## delete-mutation Source: https://docs.coval.ai/api-reference/v1/mutations/mutations/delete-mutation --- openapi: delete /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## list-test-sets Source: https://docs.coval.ai/api-reference/v1/test-sets/test-sets/list-test-sets --- openapi: get /test-sets --- --- ## create-test-set Source: https://docs.coval.ai/api-reference/v1/test-sets/test-sets/create-test-set --- openapi: post /test-sets --- --- ## get-test-set Source: https://docs.coval.ai/api-reference/v1/test-sets/test-sets/get-test-set --- openapi: get /test-sets/{test_set_id} --- --- ## update-test-set Source: https://docs.coval.ai/api-reference/v1/test-sets/test-sets/update-test-set --- openapi: patch /test-sets/{test_set_id} --- --- ## delete-test-set Source: https://docs.coval.ai/api-reference/v1/test-sets/test-sets/delete-test-set --- openapi: delete /test-sets/{test_set_id} --- --- ## list-test-cases Source: https://docs.coval.ai/api-reference/v1/test-cases/test-cases/list-test-cases --- openapi: get /test-cases --- --- ## create-test-case Source: https://docs.coval.ai/api-reference/v1/test-cases/test-cases/create-test-case --- openapi: post /test-cases --- --- ## get-test-case Source: https://docs.coval.ai/api-reference/v1/test-cases/test-cases/get-test-case --- openapi: get /test-cases/{test_case_id} --- --- ## delete-test-case Source: https://docs.coval.ai/api-reference/v1/test-cases/test-cases/delete-test-case --- openapi: delete /test-cases/{test_case_id} --- --- ## update-test-case Source: https://docs.coval.ai/api-reference/v1/test-cases/test-cases/update-test-case --- openapi: patch /test-cases/{test_case_id} --- --- ## list-personas Source: https://docs.coval.ai/api-reference/v1/personas/personas/list-personas --- openapi: get /personas --- --- ## create-persona Source: https://docs.coval.ai/api-reference/v1/personas/personas/create-persona --- openapi: post /personas --- --- ## get-persona Source: https://docs.coval.ai/api-reference/v1/personas/personas/get-persona --- openapi: get /personas/{persona_id} --- --- ## update-persona Source: https://docs.coval.ai/api-reference/v1/personas/personas/update-persona --- openapi: patch /personas/{persona_id} --- --- ## delete-persona Source: https://docs.coval.ai/api-reference/v1/personas/personas/delete-persona --- openapi: delete /personas/{persona_id} --- --- ## list-available-voices Source: https://docs.coval.ai/api-reference/v1/personas/personas/list-available-voices --- openapi: get /personas/voices --- --- ## list-phone-number-mappings Source: https://docs.coval.ai/api-reference/v1/personas/personas/list-phone-number-mappings --- openapi: get /personas/phone-numbers --- --- ## list-metrics Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/list-metrics --- openapi: get /v1/metrics --- --- ## create-metric Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/create-metric --- openapi: post /v1/metrics --- --- ## get-metric Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/get-metric --- openapi: get /v1/metrics/{metric_id} --- --- ## update-metric Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/update-metric --- openapi: patch /v1/metrics/{metric_id} --- --- ## delete-metric Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/delete-metric --- openapi: delete /v1/metrics/{metric_id} --- --- ## list-metric-thresholds Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/list-metric-thresholds --- openapi: get /v1/metrics/{metric_id}/thresholds --- --- ## create-metric-threshold Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/create-metric-threshold --- openapi: post /v1/metrics/{metric_id}/thresholds --- --- ## get-metric-threshold Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/get-metric-threshold --- openapi: get /v1/metrics/{metric_id}/threshold --- --- ## update-metric-threshold Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/update-metric-threshold --- openapi: patch /v1/metrics/{metric_id}/threshold --- --- ## delete-metric-threshold Source: https://docs.coval.ai/api-reference/v1/metrics/metrics/delete-metric-threshold --- openapi: delete /v1/metrics/{metric_id}/thresholds/{threshold_id} --- --- ## list-run-templates Source: https://docs.coval.ai/api-reference/v1/run-templates/run-templates/list-run-templates --- openapi: get /v1/run-templates --- --- ## create-run-template Source: https://docs.coval.ai/api-reference/v1/run-templates/run-templates/create-run-template --- openapi: post /v1/run-templates --- --- ## get-run-template Source: https://docs.coval.ai/api-reference/v1/run-templates/run-templates/get-run-template --- openapi: get /v1/run-templates/{run_template_id} --- --- ## update-run-template Source: https://docs.coval.ai/api-reference/v1/run-templates/run-templates/update-run-template --- openapi: patch /v1/run-templates/{run_template_id} --- --- ## delete-run-template Source: https://docs.coval.ai/api-reference/v1/run-templates/run-templates/delete-run-template --- openapi: delete /v1/run-templates/{run_template_id} --- --- ## list-scheduled-runs Source: https://docs.coval.ai/api-reference/v1/scheduled-runs/scheduled-runs/list-scheduled-runs --- openapi: get /v1/scheduled-runs --- --- ## create-scheduled-run Source: https://docs.coval.ai/api-reference/v1/scheduled-runs/scheduled-runs/create-scheduled-run --- openapi: post /v1/scheduled-runs --- --- ## get-scheduled-run Source: https://docs.coval.ai/api-reference/v1/scheduled-runs/scheduled-runs/get-scheduled-run --- openapi: get /v1/scheduled-runs/{scheduled_run_id} --- --- ## update-scheduled-run Source: https://docs.coval.ai/api-reference/v1/scheduled-runs/scheduled-runs/update-scheduled-run --- openapi: patch /v1/scheduled-runs/{scheduled_run_id} --- --- ## delete-scheduled-run Source: https://docs.coval.ai/api-reference/v1/scheduled-runs/scheduled-runs/delete-scheduled-run --- openapi: delete /v1/scheduled-runs/{scheduled_run_id} --- --- ## list-review-projects Source: https://docs.coval.ai/api-reference/v1/reviews/review-projects/list-review-projects --- openapi: get /v1/review-projects --- --- ## create-review-project Source: https://docs.coval.ai/api-reference/v1/reviews/review-projects/create-review-project --- openapi: post /v1/review-projects --- --- ## get-review-project Source: https://docs.coval.ai/api-reference/v1/reviews/review-projects/get-review-project --- openapi: get /v1/review-projects/{project_id} --- --- ## update-review-project Source: https://docs.coval.ai/api-reference/v1/reviews/review-projects/update-review-project --- openapi: patch /v1/review-projects/{project_id} --- --- ## delete-review-project Source: https://docs.coval.ai/api-reference/v1/reviews/review-projects/delete-review-project --- openapi: delete /v1/review-projects/{project_id} --- --- ## list-review-annotations Source: https://docs.coval.ai/api-reference/v1/reviews/review-annotations/list-review-annotations --- openapi: get /v1/review-annotations --- --- ## create-review-annotation Source: https://docs.coval.ai/api-reference/v1/reviews/review-annotations/create-review-annotation --- openapi: post /v1/review-annotations --- --- ## get-review-annotation Source: https://docs.coval.ai/api-reference/v1/reviews/review-annotations/get-review-annotation --- openapi: get /v1/review-annotations/{annotation_id} --- --- ## update-review-annotation Source: https://docs.coval.ai/api-reference/v1/reviews/review-annotations/update-review-annotation --- openapi: patch /v1/review-annotations/{annotation_id} --- --- ## delete-review-annotation Source: https://docs.coval.ai/api-reference/v1/reviews/review-annotations/delete-review-annotation --- openapi: delete /v1/review-annotations/{annotation_id} --- --- ## list-dashboards Source: https://docs.coval.ai/api-reference/v1/dashboards/dashboards/list-dashboards --- openapi: get /v1/dashboards --- --- ## create-dashboard Source: https://docs.coval.ai/api-reference/v1/dashboards/dashboards/create-dashboard --- openapi: post /v1/dashboards --- --- ## get-dashboard Source: https://docs.coval.ai/api-reference/v1/dashboards/dashboards/get-dashboard --- openapi: get /v1/dashboards/{dashboard_id} --- --- ## update-dashboard Source: https://docs.coval.ai/api-reference/v1/dashboards/dashboards/update-dashboard --- openapi: patch /v1/dashboards/{dashboard_id} --- --- ## delete-dashboard Source: https://docs.coval.ai/api-reference/v1/dashboards/dashboards/delete-dashboard --- openapi: delete /v1/dashboards/{dashboard_id} --- --- ## list-widgets Source: https://docs.coval.ai/api-reference/v1/dashboards/widgets/list-widgets --- openapi: get /v1/dashboards/{dashboard_id}/widgets --- --- ## create-widget Source: https://docs.coval.ai/api-reference/v1/dashboards/widgets/create-widget --- openapi: post /v1/dashboards/{dashboard_id}/widgets --- --- ## get-widget Source: https://docs.coval.ai/api-reference/v1/dashboards/widgets/get-widget --- openapi: get /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## update-widget Source: https://docs.coval.ai/api-reference/v1/dashboards/widgets/update-widget --- openapi: patch /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## delete-widget Source: https://docs.coval.ai/api-reference/v1/dashboards/widgets/delete-widget --- openapi: delete /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## list-monitors Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/list-monitors --- openapi: get /monitors --- --- ## create-a-monitor Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/create-a-monitor --- openapi: post /monitors --- --- ## get-a-monitor Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/get-a-monitor --- openapi: get /monitors/{monitor_id} --- --- ## update-a-monitor Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/update-a-monitor --- openapi: patch /monitors/{monitor_id} --- --- ## delete-a-monitor Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/delete-a-monitor --- openapi: delete /monitors/{monitor_id} --- --- ## test-evaluate-a-monitor Source: https://docs.coval.ai/api-reference/v1/monitors/monitors/test-evaluate-a-monitor --- openapi: post /monitors/{monitor_id}/test-evaluate --- --- ## list-monitor-events Source: https://docs.coval.ai/api-reference/v1/monitors/monitor-events/list-monitor-events --- openapi: get /monitors/{monitor_id}/events --- --- ## list-api-keys Source: https://docs.coval.ai/api-reference/v1/api-keys/api-keys/list-api-keys --- openapi: get /v1/api-keys --- --- ## create-api-key Source: https://docs.coval.ai/api-reference/v1/api-keys/api-keys/create-api-key --- openapi: post /v1/api-keys --- --- ## update-api-key-status Source: https://docs.coval.ai/api-reference/v1/api-keys/api-keys/update-api-key-status --- openapi: patch /v1/api-keys/{api_key_id} --- --- ## delete-api-key Source: https://docs.coval.ai/api-reference/v1/api-keys/api-keys/delete-api-key --- openapi: delete /v1/api-keys/{api_key_id} --- --- ## CLI Source: https://docs.coval.ai/cli/overview Command-line interface for the Coval AI evaluation platform The **Coval CLI** provides terminal access to Coval's evaluation APIs for scripting, automation, and CI/CD integration. - [GitHub Repository](https://github.com/coval-ai/cli): View source, releases, and contribute ## Quick Start **Step: Install the CLI** ```bash brew install coval-ai/tap/coval ``` See [Installation](/cli/installation) for all methods. **Step: Authenticate** ```bash coval login ``` **Step: Launch an evaluation** ```bash coval runs launch \ --agent-id \ --persona-id \ --test-set-id ``` **Step: Watch progress** ```bash coval runs watch ``` ## Command Reference - [Agents](/cli/agents): Create, list, update, and delete agent configurations - [Runs](/cli/runs): Launch evaluations and monitor progress in real time - [Simulations](/cli/simulations): Inspect individual simulation results and download audio - [Conversations](/cli/conversations): Inspect monitored conversations, audio, and metric results - [Test Sets](/cli/test-sets): Organize test cases into collections - [Test Cases](/cli/test-cases): Define inputs and expected outputs for evaluations - [Personas](/cli/personas): Configure simulated callers with voice and language settings - [Metrics](/cli/metrics): Define how simulations are scored and evaluated - [Mutations](/cli/mutations): Test agent variations with config overrides - [API Keys](/cli/api-keys): Manage API keys for programmatic access - [Run Templates](/cli/run-templates): Save reusable evaluation configurations - [Scheduled Runs](/cli/scheduled-runs): Schedule recurring evaluation runs - [Dashboards](/cli/dashboards): Create dashboards and widgets for monitoring ## Global Flags All commands support these flags: | Flag | Description | Default | |------|-------------|---------| | `--format ` | Output format: `table` or `json` | `table` | | `--api-key ` | Override API key for this command | — | | `--api-url ` | Override API base URL | — | | `--help` | Show help for any command | — | ## JSON Output for Scripting Use `--format json` to get machine-readable output: ```bash # Get run status coval runs get abc123 --format json | jq '.status' # List agent IDs coval agents list --format json | jq '.[].id' # Extract simulation transcript coval simulations get sim123 --format json | jq '.transcript' ``` ## Requirements - macOS, Linux, or Windows - Coval API key from [Dashboard Settings](https://app.coval.dev/settings) --- ## Installation & Configuration Source: https://docs.coval.ai/cli/installation Install the Coval CLI and configure authentication ## Installation **Homebrew:** ```bash brew install coval-ai/tap/coval ``` **Cargo:** ```bash cargo install coval ``` Requires [Rust](https://rustup.rs/) to be installed. **Binary:** Download pre-built binaries from [GitHub Releases](https://github.com/coval-ai/cli/releases). Verify your installation: ```bash coval --help ``` ## Authentication ### Interactive Login ```bash coval login ``` You will be prompted to enter your API key. Get one from [Dashboard Settings](https://app.coval.dev/settings). ### API Key Flag Pass your API key directly: ```bash coval login --api-key sk_your_api_key ``` > **Warning:** Passing API keys as command arguments can expose them in shell history and process lists. For CI/CD pipelines, prefer using the `COVAL_API_KEY` environment variable or your CI provider's secret management instead. ### Verify Authentication ```bash coval whoami ``` Displays your masked API key (e.g., `sk_...****`) to confirm you are authenticated. ## Configuration The CLI stores configuration in a platform-specific config directory. Run `coval config path` to see the exact location on your system. ### View Config Path ```bash coval config path ``` ### Get a Config Value ```bash coval config get api_key coval config get api_url ``` ### Set a Config Value ```bash coval config set api_key sk_your_api_key coval config set api_url https://api.coval.dev ``` ### Config File Format ```toml api_key = "sk_..." api_url = "https://api.coval.dev" ``` ## Environment Variables Environment variables override config file values: | Variable | Description | |----------|-------------| | `COVAL_API_KEY` | API key (overrides config file) | | `COVAL_API_URL` | API base URL (overrides config file) | ```bash # Use in CI/CD pipelines export COVAL_API_KEY=sk_your_api_key coval runs launch --agent-id abc123 --persona-id xyz789 --test-set-id ts123 ``` ## Supported Platforms - macOS (Intel and Apple Silicon) - Linux (x86_64) - Windows --- ## Agent Mode Source: https://docs.coval.ai/cli/agent-mode Run the Coval CLI from autonomous agents with structured I/O, discovery, and next actions > **Note:** Coval's agent mode implements [Open ACI](https://github.com/lorenjphillips/open-aci), an open draft specification (v0.1) for agent-native command-line interfaces. The Coval CLI is its reference implementation. Agent mode is a global behavior toggled with `--agent`. It turns every command into a non-interactive call that returns a stable JSON envelope, suppresses prompts and progress UI, and emits suggested next actions for an orchestrating agent to chain on. ## When to use it | Use ordinary mode when | Use agent mode when | |---|---| | A human is exploring at a terminal | A coding agent, script, or wrapper invokes the CLI | | You want spinners, prompts, and color | You need a stable contract for parsing | | You want a pager or table view | You want structured errors and next-action hints | ## Quick start ```bash # Discover what the CLI can do coval --agent agent doctor coval --agent agent manifest # Inspect a resource surface coval --agent runs context # Launch a run with structured input coval --agent runs launch --input-json @run.json ``` ## The envelope Every agent-mode response — success or failure — returns the same top-level envelope. ### Success ```json { "aci": "0.1", "ok": true, "resource": "runs", "operation": "launch", "summary": "Run run_123 was queued.", "data": { "id": "run_123", "status": "queued" }, "warnings": [], "next_actions": [ { "id": "watch", "label": "Watch run", "argv": ["coval", "--agent", "runs", "watch", "run_123"], "safe": true, "primary": true, "requires_confirmation": false } ] } ``` ### Error ```json { "aci": "0.1", "ok": false, "resource": "runs", "operation": "launch", "error": { "code": "invalid_input", "message": "test_set_id is required", "field": "test_set_id" }, "warnings": [], "next_actions": [] } ``` ### Field reference | Field | Description | |---|---| | `aci` | Open ACI version this envelope conforms to | | `ok` | `true` on success, `false` on error | | `resource` / `operation` | The command namespace and verb | | `summary` | One-line human-facing summary; safe to surface in chat UIs | | `data` | Operation result. List operations return arrays; one operations return objects | | `warnings` | Non-fatal advisories the agent should consider before chaining | | `next_actions` | Suggested follow-up commands with full `argv` ready to execute | | `error` | Present only when `ok` is `false` | Agent mode takes precedence over `--format json` and disables interactive prompts, progress bars, audio downloads, and not-found hints that would otherwise write to stderr. ## Discovery Two commands let an agent introspect the CLI without prior knowledge of its surface. ### `agent doctor` Reports whether the CLI is installed, authenticated, and able to reach the API. Run this first when bootstrapping. ```bash coval --agent agent doctor ``` ### `agent manifest` Returns the full set of resources, operations, supported profiles (structured input, next actions, skills), and `help_argv` pointers. Agents should call this once per session and rely on `--help` for exhaustive flag detail rather than caching command shapes. ```bash coval --agent agent manifest ``` ## Resource context Every resource exposes a `context` subcommand that describes its operations, required fields, and primary next actions — without making a network call or requiring auth. ```bash coval --agent runs context coval --agent agents context coval --agent test-sets context ``` Use `context` to plan a chain of calls before executing any of them. ## Structured input with `--input-json` Body-bearing commands (`create`, `update`, `launch`, `submit`) accept structured input via `--input-json`. Explicit CLI flags overlay the JSON, so agents can pass a base payload and override individual fields. ```bash # Inline JSON coval --agent runs launch --input-json '{"agent_id":"a_1","test_set_id":"ts_1"}' # From a file coval --agent runs launch --input-json @run.json # From stdin cat run.json | coval --agent runs launch --input-json - # JSON + flag override (flag wins) coval --agent runs launch \ --input-json @run.json \ --persona-id persona_override ``` Invalid JSON in agent mode returns a structured `invalid_input` error rather than a free-text parse message. ## Agent skills `agent skills` lets an agent enumerate and install local skill bundles. By design there is no embedded catalog and no hardcoded remote source — agents must clone or pin a skills repo locally and point at it explicitly. ```bash # List skills available in a local source directory coval --agent agent skills list --source ./path/to/skills # Install a skill into a local destination coval --agent agent skills install \ --source ./path/to/skills \ --dest ./.agent-skills ``` Equivalent environment variables: `COVAL_SKILLS_SOURCE`, `COVAL_SKILLS_DEST`. Remote sources (URLs) are rejected with a structured error directing the agent to clone and pin the skill repository locally. Skill IDs containing path traversal are rejected before any filesystem access. ## Safety model - `next_actions` carry a `safe` boolean and a `requires_confirmation` flag. Mutating actions (skill installs, runs launch, etc.) are marked accordingly so an orchestrator can gate them. - Agent mode never prompts. A command that would prompt in human mode returns an error envelope in agent mode. - Stdout is the envelope only. Diagnostic chatter that humans see (progress, hints, audio paths) is suppressed. --- ## Agents Source: https://docs.coval.ai/cli/agents Manage AI agent configurations with the Coval CLI ## List Agents ```bash coval agents list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `model_type="voice"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, NAME, TYPE, CREATED ```bash # List all agents coval agents list # Filter by type coval agents list --filter 'model_type="voice"' # JSON output coval agents list --format json ``` ## Get Agent ```bash coval agents get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID | Returns full agent details as JSON including configuration, metadata, and associated resources. ```bash coval agents get ag_abc123 ``` ## Create Agent ```bash coval agents create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the agent | | `--type` | string | **Yes** | Agent type (see below) | | `--phone-number` | string | Conditional | Phone number in E.164 format (required for `voice`, `sms`) | | `--endpoint` | string | Conditional | Webhook URL (required for `outbound-voice`) | | `--prompt` | string | No | System prompt / instructions | | `--metadata` | string | No | JSON string for agent metadata (e.g., `chat_endpoint`, `input_template`) | | `--metric-ids` | string | No | Comma-separated metric IDs to associate | | `--test-set-ids` | string | No | Comma-separated test set IDs to associate | ### Agent Types | Type | Description | Required Fields | |------|-------------|-----------------| | `voice` | Inbound voice agent | `--phone-number` | | `outbound-voice` | Outbound voice agent | `--endpoint` | | `chat` | Chat/text-based agent | `metadata.chat_endpoint` | | `sms` | SMS messaging agent | `--phone-number` | | `websocket` | WebSocket-based agent | `metadata.endpoint`, `metadata.initialization_json` | > **Info:** For `chat` and `websocket` agents, required fields like `chat_endpoint` and `initialization_json` are passed via the `--metadata` flag as a JSON string. ```bash # Create a voice agent coval agents create \ --name "Support Agent" \ --type voice \ --phone-number "+15551234567" # Create a chat agent with metadata coval agents create \ --name "Chat Bot" \ --type chat \ --metadata '{"chat_endpoint":"https://api.example.com/chat"}' # Create with associated metrics and test sets coval agents create \ --name "Support Agent" \ --type voice \ --phone-number "+15551234567" \ --metric-ids "met_abc,met_def" \ --test-set-ids "ts_123" ``` ## Update Agent ```bash coval agents update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--type` | string | New agent type | | `--phone-number` | string | New phone number | | `--endpoint` | string | New endpoint URL | | `--prompt` | string | New system prompt | | `--metadata` | string | JSON string for agent metadata | | `--metric-ids` | string | Comma-separated metric IDs | | `--test-set-ids` | string | Comma-separated test set IDs | ```bash # Update agent name coval agents update ag_abc123 --name "Updated Agent Name" # Update agent metadata (e.g., chat endpoint and input template) coval agents update ag_abc123 \ --metadata '{"chat_endpoint":"https://proxy.example.com/chat","input_template":"{\"user_id\":\"{{user_id}}\"}"}' ``` ## Delete Agent ```bash coval agents delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID to delete | ```bash coval agents delete ag_abc123 ``` --- ## Runs Source: https://docs.coval.ai/cli/runs Launch and manage evaluation runs with the Coval CLI ## List Runs ```bash coval runs list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `status="COMPLETED"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, STATUS, PROGRESS, CREATED ```bash # List all runs coval runs list # Filter completed runs coval runs list --filter 'status="COMPLETED"' ``` ## Get Run ```bash coval runs get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID | Returns full run details as JSON including status, progress, results, and metrics. ```bash coval runs get run_abc123 ``` ## Launch Run ```bash coval runs launch [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | Agent ID to evaluate | | `--persona-id` | string | **Yes** | Persona ID for simulated caller | | `--test-set-id` | string | **Yes** | Test set ID containing test cases | | `--iterations` | number | No | Iterations per test case (default: 1) | | `--concurrency` | number | No | Parallel simulations | | `--name` | string | No | Display name for the run | | `--mutation-id` | string | No | Single mutation ID to test | | `--mutation-ids` | string | No | Comma-separated mutation IDs | ```bash # Basic run coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 # Run with options coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --iterations 3 \ --concurrency 5 \ --name "Regression Test" # Run with mutations coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-ids "mut_001,mut_002,mut_003" ``` ## Update Run Modify tags on an existing run. Tags are fully replaced — pass the complete list you want on the run. ```bash coval runs update --tags ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID to update | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--tags` | string | **Yes** | Comma-separated tags; replaces existing tags. Pass `""` to clear all tags. | ```bash # Mark a run as the current baseline coval runs update run_abc123 --tags baseline,prod # Remove all tags from a run coval runs update run_abc123 --tags "" ``` > **Info:** Max 20 tags per run, each up to 200 characters. Whitespace is trimmed and duplicates are dropped. ## Watch Run Monitor a run's progress in real time with a live progress bar. ```bash coval runs watch [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID to watch | | Option | Type | Default | Description | |--------|------|---------|-------------| | `--interval` | number | 2 | Poll interval in seconds | ```bash # Watch with default interval coval runs watch run_abc123 # Watch with faster polling coval runs watch run_abc123 --interval 1 ``` The watch command displays a progress bar and exits when the run reaches a terminal status. ## Delete Run ```bash coval runs delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID to delete | ## Run Statuses | Status | Description | |--------|-------------| | `PENDING` | Run is created but not yet started | | `IN_QUEUE` | Run is queued for execution | | `IN_PROGRESS` | Simulations are actively running | | `COMPLETED` | All simulations finished successfully | | `FAILED` | Run encountered an error | | `CANCELLED` | Run was cancelled | | `DELETED` | Run was deleted | > **Info:** When using `--filter`, use the underscore-separated enum values (e.g., `status="IN_PROGRESS"`). --- ## Simulations Source: https://docs.coval.ai/cli/simulations View simulation results and download audio with the Coval CLI ## List Simulations ```bash coval simulations list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--run-id` | string | — | Filter by run ID | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, STATUS, RUN, TEST CASE, AUDIO ```bash # List all simulations coval simulations list # Filter by run coval simulations list --run-id run_abc123 # Combine filters coval simulations list --filter 'status="COMPLETED"' --run-id run_abc123 ``` ## Get Simulation ```bash coval simulations get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | Returns full simulation details as JSON including transcript, status, and metadata. ```bash coval simulations get sim_abc123 ``` ## Download Audio Download or get the audio URL for a simulation recording. ```bash coval simulations audio [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | | Option | Type | Description | |--------|------|-------------| | `-o, --output` | string | File path to save audio | ```bash # Print audio URL coval simulations audio sim_abc123 # Download audio file coval simulations audio sim_abc123 -o recording.wav ``` When using `-o`, a progress bar shows the download status. ## List Metrics List all metric results for a simulation. ```bash coval simulations metrics ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | **Output columns:** OUTPUT ID, METRIC ID, STATUS, VALUE, SUBVALUES ```bash coval simulations metrics sim_abc123 ``` ## Get Metric Detail Retrieve metric results for a simulation. The second argument accepts two ID types and the output adapts accordingly: - **26-char MetricOutput ULID** — prints one row (the specific output). - **22-char Metric definition ID** — prints every output recorded for that metric on the simulation. ```bash coval simulations metric-detail ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | | `id` | string | **Yes** | Either a 26-char MetricOutput ULID or a 22-char Metric definition ID | By MetricOutput ULID (one row): ```bash coval simulations metric-detail sim_abc123 01ARZ3NDEKTSV4RRFFQ69G5FAV ``` By Metric definition ID (one or more rows): ```bash coval simulations metric-detail sim_abc123 29BlkepvvX19ebbLDB0y6Q ``` ## Delete Simulation ```bash coval simulations delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID to delete | ## Simulation Statuses | Status | Description | |--------|-------------| | `PENDING` | Simulation is created but not yet started | | `IN_QUEUE` | Simulation is queued for execution | | `IN_PROGRESS` | Simulation is actively running | | `COMPLETED` | Simulation finished successfully | | `FAILED` | Simulation encountered an error | | `CANCELLED` | Simulation was cancelled | | `DELETED` | Simulation was deleted | > **Info:** When using `--filter`, use the underscore-separated enum values (e.g., `status="IN_PROGRESS"`). --- ## Conversations Source: https://docs.coval.ai/cli/conversations Inspect monitored conversations, audio, and metric results with the Coval CLI A **conversation** is a recorded interaction (voice call or chat) submitted to Coval for monitoring evaluation. Conversations are distinct from [simulations](/cli/simulations), which are synthetic test runs generated by Coval. ## Submit Conversation Submit a recorded conversation for monitoring evaluation. At least one input source is required (`--transcript-file`, `--audio-file`, `--audio-url`, or `--upload-id`); audio sources are mutually exclusive. ```bash coval conversations submit [OPTIONS] ``` | Option | Type | Description | |--------|------|-------------| | `--transcript-file` | path | JSON file containing the transcript (an array of message objects). | | `--audio-file` | path | Local audio file. The CLI base64-encodes the bytes into the request body. | | `--audio-url` | string | Presigned URL to audio (S3, GCS, Azure Blob, or any HTTPS URL). | | `--upload-id` | string | Reference to a prior `POST /v1/audio:upload` (`upl_<26-char ULID>`). | | `--metric` | string | Metric ID to evaluate. Repeat for multiple. Defaults to your org's monitoring metrics. | | `--metadata` | `key=value` | Custom metadata for filtering and conditional metrics. Repeat for multiple. | | `--external-id` | string | External conversation ID from your system. | | `--agent-id` | string | Agent to associate with this conversation (22-char ID). | | `--occurred-at` | ISO 8601 | When the conversation actually occurred. | Returns the created conversation in `PENDING` status. Poll `coval conversations get ` to watch evaluation progress. ```bash # Submit a transcript with custom metrics and metadata coval conversations submit \ --transcript-file ./call.json \ --metric 29BlkepvvX19ebbLDB0y6Q \ --metric mymKvEg6ZA65srXbTX5wSM \ --metadata campaign=summer-2026 \ --metadata customer_id=cust-abc-123 \ --external-id call-abc-123 \ --agent-id gk3jK9mPq2xRt5vW8yZaBc \ --occurred-at 2026-05-05T12:34:56Z # Submit audio by presigned URL coval conversations submit \ --audio-url 'https://bucket.s3.amazonaws.com/audio.wav?X-Amz-Algorithm=...' \ --external-id call-abc-123 # Submit a local audio file (base64-encoded into the request body) coval conversations submit --audio-file ./recording.wav --external-id call-abc-123 ``` > **Info:** For audio files larger than a few megabytes, prefer `--audio-url` or `--upload-id` over `--audio-file` — the API Gateway request body is capped at 10 MB. ## List Conversations ```bash coval conversations list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, STATUS, EXTERNAL ID, AUDIO, OCCURRED AT ```bash # List recent conversations coval conversations list # Filter by status coval conversations list --filter 'status="COMPLETED"' # Filter by external ID coval conversations list --filter 'external_conversation_id="call-abc-123"' ``` ## Get Conversation ```bash coval conversations get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `conversation_id` | string | **Yes** | The conversation ID | Returns full conversation details as JSON including transcript, status, agent and persona references, progress, and metadata. ```bash coval conversations get conv_abc123 ``` ## Download Audio Download or get the audio URL for a conversation recording. ```bash coval conversations audio [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `conversation_id` | string | **Yes** | The conversation ID | | Option | Type | Description | |--------|------|-------------| | `-o, --output` | string | File path to save audio | ```bash # Print audio URL coval conversations audio conv_abc123 # Download audio file coval conversations audio conv_abc123 -o recording.wav ``` When using `-o`, a progress bar shows the download status. ## List Metrics List all metric results for a conversation. ```bash coval conversations metrics ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `conversation_id` | string | **Yes** | The conversation ID | **Output columns:** OUTPUT ID, METRIC ID, STATUS, VALUE, SUBVALUES ```bash coval conversations metrics conv_abc123 ``` ## Get Metric Detail Retrieve metric results for a conversation. The second argument accepts two ID types and the output adapts accordingly: - **26-char MetricOutput ULID** — prints one row (the specific output). - **22-char Metric definition ID** — prints every output recorded for that metric on the conversation. ```bash coval conversations metric-detail ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `conversation_id` | string | **Yes** | The conversation ID | | `id` | string | **Yes** | Either a 26-char MetricOutput ULID or a 22-char Metric definition ID | By MetricOutput ULID (one row): ```bash coval conversations metric-detail conv_abc123 01JCQR8Z9PQSTNVWXY12345678 ``` By Metric definition ID (one or more rows): ```bash coval conversations metric-detail conv_abc123 4HTX6gnqXtpexWSLNaKdC4 ``` ## Delete Conversation ```bash coval conversations delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `conversation_id` | string | **Yes** | The conversation ID to delete | ## Conversation Statuses | Status | Description | |--------|-------------| | `PENDING` | Conversation is created but not yet started | | `IN_QUEUE` | Conversation is queued for evaluation | | `IN_PROGRESS` | Metrics are actively running against the conversation | | `COMPLETED` | Evaluation finished successfully | | `FAILED` | Evaluation encountered an error | | `CANCELLED` | Evaluation was cancelled | | `DELETED` | Conversation was deleted | > **Info:** When using `--filter`, use the underscore-separated enum values (e.g., `status="IN_PROGRESS"`). --- ## Test Sets Source: https://docs.coval.ai/cli/test-sets Manage test set collections with the Coval CLI ## List Test Sets ```bash coval test-sets list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, TYPE, CASES, CREATED ```bash coval test-sets list ``` ## Get Test Set ```bash coval test-sets get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID | ```bash coval test-sets get ts_abc123 ``` ## Create Test Set ```bash coval test-sets create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Test set name | | `--slug` | string | No | URL-friendly identifier (auto-generated if omitted) | | `--description` | string | No | Description of the test set | | `--type` | string | No | Test set type: `DEFAULT`, `SCENARIO`, `TRANSCRIPT`, or `WORKFLOW` | ```bash # Create a basic test set coval test-sets create --name "Customer Support Scenarios" # Create with all options coval test-sets create \ --name "Billing Scenarios" \ --slug "billing-scenarios" \ --description "Test cases for billing-related inquiries" \ --type SCENARIO ``` ## Update Test Set ```bash coval test-sets update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New name | | `--slug` | string | New slug | | `--description` | string | New description | ```bash coval test-sets update ts_abc123 --name "Updated Name" ``` ## Delete Test Set ```bash coval test-sets delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID to delete | --- ## Test Cases Source: https://docs.coval.ai/cli/test-cases Manage individual test cases with the Coval CLI ## List Test Cases ```bash coval test-cases list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--test-set-id` | string | — | Filter by test set ID | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, INPUT, TYPE, TEST SET, CREATED ```bash # List all test cases coval test-cases list # Filter by test set coval test-cases list --test-set-id ts_abc123 ``` ## Get Test Case ```bash coval test-cases get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID | ```bash coval test-cases get tc_abc123 ``` ## Create Test Case Create a single test case or bulk import from stdin. ```bash coval test-cases create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--test-set-id` | string | **Yes** | Test set to add the case to | | `--input` | string | No | Test case input text | | `--expected` | string | No | Expected output | | `--description` | string | No | Test case description | | `--stdin` | flag | No | Read test cases from stdin (JSON) | > **Info:** You must provide exactly one of `--input` or `--stdin`. They are mutually exclusive — supplying both or neither will result in an error. ### Single Test Case ```bash coval test-cases create \ --test-set-id ts_abc123 \ --input "I need help with my order" \ --expected "Order assistance provided" \ --description "Basic order help request" ``` ### Bulk Import from Stdin Pass `--stdin` to read one JSON object per line: ```bash echo '{"input_str": "I need a refund", "expected_output_str": "Refund processed", "description": "Refund request"} {"input_str": "Where is my order?", "expected_output_str": "Order status provided", "description": "Order tracking"}' \ | coval test-cases create --test-set-id ts_abc123 --stdin ``` Or import from a file: ```bash cat test_cases.jsonl | coval test-cases create --test-set-id ts_abc123 --stdin ``` Each line must be valid JSON with the following fields: | Field | Type | Required | Description | |-------|------|----------|-------------| | `input_str` | string | **Yes** | Input text | | `expected_output_str` | string | No | Expected output | | `description` | string | No | Description | ## Update Test Case ```bash coval test-cases update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID to update | | Option | Type | Description | |--------|------|-------------| | `--input` | string | New input text | | `--expected` | string | New expected output | | `--description` | string | New description | ```bash coval test-cases update tc_abc123 --input "Updated input text" ``` ## Delete Test Case ```bash coval test-cases delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID to delete | --- ## Personas Source: https://docs.coval.ai/cli/personas Manage simulated personas with the Coval CLI ## List Personas ```bash coval personas list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, VOICE, LANGUAGE, CREATED ```bash coval personas list ``` ## Get Persona ```bash coval personas get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID | ```bash coval personas get per_abc123 ``` ## Create Persona ```bash coval personas create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Persona display name | | `--voice` | string | **Yes** | Voice name (see available voices below) | | `--language` | string | **Yes** | Language code (e.g., `en-US`) | | `--prompt` | string | No | Persona system prompt / behavior instructions | | `--background` | string | No | Background sound during simulation | | `--wait-seconds` | number | No | Wait time between responses | ```bash # Create a basic persona coval personas create \ --name "Frustrated Customer" \ --voice "Aria" \ --language "en-US" # Create with full configuration coval personas create \ --name "Impatient Caller" \ --voice "Callum" \ --language "en-US" \ --prompt "You are an impatient customer who wants quick answers" \ --background "office" \ --wait-seconds 1.5 ``` ### Available Voices Alejandro, Amir, Angela, Aria, Ashwin, Autumn, Brynn, Callum, Caspian, Corwin, Darrow, Delphine, Dorian, Elara, Erika, Harry, Kieran, Layla, Lysander, Marina, Mark, Monika, Naveen, Noa, Orion, Raju, Rowan, Skye, Soren, Vera, Yossi ## Update Persona ```bash coval personas update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--voice` | string | New voice name | | `--language` | string | New language code | | `--prompt` | string | New system prompt | | `--background` | string | New background sound | | `--wait-seconds` | number | New wait time | ```bash coval personas update per_abc123 --voice "Brynn" --wait-seconds 2.0 ``` ## Delete Persona ```bash coval personas delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID to delete | --- ## Metrics Source: https://docs.coval.ai/cli/metrics Manage evaluation metrics with the Coval CLI ## List Metrics ```bash coval metrics list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (supports metric_type, metric_name, create_time) | | `--page-size` | number | 50 | Results per page (1-100) | | `--order-by` | string | — | Sort field, prefix with `-` for descending | | `--include-builtin` | flag | — | Include built-in metrics (e.g. Turn Count, Audio Duration) | **Output columns:** ID, NAME, TYPE, CREATED ```bash coval metrics list ``` ## Get Metric ```bash coval metrics get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID | ```bash coval metrics get met_abc123 ``` ## Create Metric ```bash coval metrics create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Metric display name | | `--description` | string | **Yes** | What this metric evaluates | | `--type` | string | **Yes** | Metric type (see below) | | `--prompt` | string | No | LLM evaluation prompt (required for `llm-binary`, `categorical`, `numerical` and their audio variants) | | `--categories` | string | No | Comma-separated categories (required for `categorical`, `audio-categorical`) | | `--min-value` | number | No | Minimum value (required for `numerical`, `audio-numerical`) | | `--max-value` | number | No | Maximum value (required for `numerical`, `audio-numerical`) | | `--regex-pattern` | string | No | Regex pattern to match (required for `regex`) | | `--role` | string | No | Transcript role to match against (optional for `regex`) | | `--match-mode` | string | No | `presence` (default) or `absence` — absence returns 1 if pattern NOT found | | `--position` | string | No | `any` (default), `first`, or `last` message of the role | | `--case-insensitive` | boolean | No | Enable case-insensitive matching | | `--metadata-field-type` | string | No | Metadata field type (required for `metadata`) | | `--metadata-field-key` | string | No | Metadata field key to extract (required for `metadata`) | | `--min-pause-duration-seconds` | number | No | Minimum pause duration threshold (required for `pause`) | ### Metric Types | Type | Description | Type-Specific Options | |------|-------------|----------------------| | `llm-binary` | Binary (yes/no) LLM judgment | `--prompt` | | `categorical` | Categorical LLM judgment with defined options | `--prompt`, `--categories` | | `numerical` | Numerical score from LLM judgment | `--prompt`, `--min-value`, `--max-value` | | `audio-binary` | Binary audio analysis | `--prompt` | | `audio-categorical` | Categorical audio analysis | `--prompt`, `--categories` | | `audio-numerical` | Numerical audio analysis | `--prompt`, `--min-value`, `--max-value` | | `toolcall` | Tool call success verification | — | | `metadata` | Extract metadata field value | `--metadata-field-type`, `--metadata-field-key` | | `regex` | Match transcript against a regex pattern | `--regex-pattern`, `--role`, `--match-mode`, `--position`, `--case-insensitive` | | `pause` | Analyze pause durations in audio | `--min-pause-duration-seconds` | ### Examples ```bash # LLM Binary coval metrics create \ --name "Issue Resolved" \ --description "Did the agent resolve the customer issue?" \ --type llm-binary \ --prompt "Was the customer's issue fully resolved?" # Categorical coval metrics create \ --name "Sentiment" \ --description "Customer sentiment during the call" \ --type categorical \ --categories "positive,neutral,negative" \ --prompt "What was the customer's overall sentiment?" # Numerical coval metrics create \ --name "Professionalism Score" \ --description "Rate the agent's professionalism" \ --type numerical \ --min-value 1 \ --max-value 10 \ --prompt "Rate the agent's professionalism on a scale of 1-10" # Audio Binary coval metrics create \ --name "Background Noise" \ --description "Is there excessive background noise?" \ --type audio-binary \ --prompt "Is there excessive background noise in the audio?" # Audio Numerical coval metrics create \ --name "Audio Clarity" \ --description "Rate the audio clarity" \ --type audio-numerical \ --min-value 1 \ --max-value 5 \ --prompt "Rate the audio clarity on a scale of 1-5" # Tool Call coval metrics create \ --name "Tool Usage" \ --description "Did the agent use the correct tool?" \ --type toolcall # Metadata coval metrics create \ --name "Response Time" \ --description "Extract the response time from metadata" \ --type metadata \ --metadata-field-type "number" \ --metadata-field-key "response_time_ms" # Regex — basic pattern match coval metrics create \ --name "Greeting Check" \ --description "Did the agent greet the customer?" \ --type regex \ --regex-pattern "(hello|hi|welcome|good morning)" \ --role "agent" \ --case-insensitive # Regex — compliance (absence mode) coval metrics create \ --name "No Unauthorized Promises" \ --description "Agent must not make unauthorized promises" \ --type regex \ --regex-pattern "(guarantee|promise|definitely)" \ --role "agent" \ --match-mode "absence" \ --case-insensitive # Regex — first message disclosure coval metrics create \ --name "Recording Disclosure" \ --description "Agent must state recording disclosure in first message" \ --type regex \ --regex-pattern "this call may be recorded" \ --role "agent" \ --position "first" \ --case-insensitive # Pause coval metrics create \ --name "Long Pauses" \ --description "Detect pauses longer than 3 seconds" \ --type pause \ --min-pause-duration-seconds 3.0 ``` ## Update Metric ```bash coval metrics update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--description` | string | New description | | `--prompt` | string | New evaluation prompt | ```bash coval metrics update met_abc123 --prompt "Updated evaluation prompt" ``` ## Delete Metric ```bash coval metrics delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID to delete | --- ## Mutations Source: https://docs.coval.ai/cli/mutations Test agent variations with config overrides using the Coval CLI Mutations let you test variations of an agent by overriding configuration values without modifying the original agent. This is useful for A/B testing prompts, parameters, or model settings. ## List Mutations ```bash coval mutations list --agent-id [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--page-size` | number | No | Results per page (default: 50) | **Output columns:** ID, NAME, PARAMETERS, CREATED ```bash coval mutations list --agent-id ag_abc123 ``` ## Get Mutation ```bash coval mutations get --agent-id ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | ```bash coval mutations get --agent-id ag_abc123 mut_xyz789 ``` ## Create Mutation ```bash coval mutations create --agent-id [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--name` | string | **Yes** | Mutation display name | | `--description` | string | No | Description of what this mutation changes | | `--config` | string | No | JSON config overrides | ```bash # Create a mutation with config overrides coval mutations create \ --agent-id ag_abc123 \ --name "Higher Temperature" \ --description "Test with increased temperature" \ --config '{"temperature": 0.9}' # Create a prompt variation coval mutations create \ --agent-id ag_abc123 \ --name "Formal Tone" \ --description "Agent uses formal language" \ --config '{"prompt": "You are a formal customer service agent. Always use professional language."}' ``` ## Update Mutation ```bash coval mutations update --agent-id [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID to update | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--name` | string | No | New display name | | `--description` | string | No | New description | | `--config` | string | No | New JSON config overrides | ```bash coval mutations update --agent-id ag_abc123 mut_xyz789 \ --config '{"temperature": 0.7}' ``` ## Delete Mutation ```bash coval mutations delete --agent-id ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID to delete | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | ```bash coval mutations delete --agent-id ag_abc123 mut_xyz789 ``` ## Using Mutations in Runs Pass mutation IDs when launching a run to test agent variations: ```bash # Test a single mutation coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-id mut_001 # Test multiple mutations coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-ids "mut_001,mut_002,mut_003" ``` --- ## API Keys Source: https://docs.coval.ai/cli/api-keys Manage API keys for programmatic access with the Coval CLI API keys provide programmatic access to the Coval API. You can create keys scoped to specific environments and permissions. > **Tip:** You can also create and manage API keys from the dashboard. See the [API Keys guide](/guides/api-keys) for instructions. ## List API Keys ```bash coval api-keys list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | | `--status` | string | — | Filter by status (`active`, `revoked`, `suspended`, `expired`) | | `--environment` | string | — | Filter by environment (`production`, `staging`, `development`) | **Output columns:** ID, NAME, TYPE, ENV, STATUS, PERMISSIONS, LAST USED ```bash # List all API keys coval api-keys list # Filter by environment coval api-keys list --environment production # List only active keys coval api-keys list --status active ``` ## Create API Key ```bash coval api-keys create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the key | | `--description` | string | No | Optional description | | `--type` | string | **Yes** | Key type (`service` or `user`) | | `--environment` | string | **Yes** | Target environment (`production`, `staging`, `development`) | | `--permissions` | string | No | Comma-separated permission scopes | > **Warning:** The full API key is only shown once at creation time. Store it securely — it cannot be retrieved later. ### Key Types | Type | Description | |------|-------------| | `service` | For server-to-server integrations and CI/CD pipelines | | `user` | For individual user access | ```bash # Create a production service key coval api-keys create \ --name "CI Pipeline" \ --type service \ --environment production # Create a development key with description coval api-keys create \ --name "Dev Testing" \ --type user \ --environment development \ --description "Key for local development" ``` ## Update API Key ```bash coval api-keys update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `api_key_id` | string | **Yes** | The API key ID to update | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--status` | string | **Yes** | New status (`active`, `revoked`, `suspended`, `expired`) | | `--reason` | string | No | Reason for the status change | ```bash # Revoke a key coval api-keys update ak_abc123 --status revoked # Revoke with a reason coval api-keys update ak_abc123 --status revoked --reason "Key compromised" ``` ## Delete API Key ```bash coval api-keys delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `api_key_id` | string | **Yes** | The API key ID to delete | ```bash coval api-keys delete ak_abc123 ``` --- ## Human Review Source: https://docs.coval.ai/cli/human-review Manage human review projects and annotations with the Coval CLI > **Tip:** Using Claude Code? We have [skills to support human review](https://github.com/coval-ai/coval-external-skills/tree/main/skills/human-review) in your workflow. ## Review Projects ### List Review Projects ```bash coval review-projects list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, NAME, TYPE, ASSIGNEES, SIMULATIONS, METRICS, CREATED ```bash # List all review projects coval review-projects list # Sort by most recent coval review-projects list --order-by "-create_time" # JSON output coval review-projects list --format json ``` ### Get Review Project ```bash coval review-projects get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The review project ID | Returns full project details as JSON including assignees, linked simulations, and linked metrics. ```bash coval review-projects get 01HXYZ1234567890ABCDEF ``` ### Create Review Project ```bash coval review-projects create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the project | | `--assignees` | string | **Yes** | Comma-separated reviewer email addresses | | `--simulation-ids` | string | **Yes** | Comma-separated simulation output IDs | | `--metric-ids` | string | **Yes** | Comma-separated metric IDs | | `--description` | string | No | Project description | | `--type` | string | No | `collaborative` or `individual` (default: `individual`) | | `--notifications` | boolean | No | Enable email notifications (default: `true`) | > **Info:** Creating a project auto-generates review annotations for every (simulation, metric, assignee) combination. > **Info:** **Finding your IDs:** Run `coval metrics list` to get metric IDs and `coval simulations list` to get simulation IDs. ```bash # Create a collaborative review project coval review-projects create \ --name "Q1 Voice Agent Review" \ --assignees "alice@company.com,bob@company.com" \ --simulation-ids "sim-output-001,sim-output-002" \ --metric-ids "metric-accuracy,metric-latency" \ --type collaborative # Create with description and notifications disabled coval review-projects create \ --name "Internal Audit" \ --assignees "reviewer@company.com" \ --simulation-ids "sim-output-003" \ --metric-ids "metric-accuracy" \ --description "Spot-check accuracy labels" \ --notifications false ``` ### Update Review Project ```bash coval review-projects update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The project ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | Updated display name | | `--assignees` | string | Updated comma-separated reviewer emails | | `--simulation-ids` | string | Updated comma-separated simulation IDs | | `--metric-ids` | string | Updated comma-separated metric IDs | | `--description` | string | Updated description | | `--notifications` | boolean | Updated notification setting | ```bash # Add a new assignee coval review-projects update 01HXYZ1234567890ABCDEF \ --assignees "alice@company.com,bob@company.com,charlie@company.com" # Update project name coval review-projects update 01HXYZ1234567890ABCDEF \ --name "Q1 Voice Agent Review - Updated" ``` ### Delete Review Project ```bash coval review-projects delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The project ID to delete | ```bash coval review-projects delete 01HXYZ1234567890ABCDEF ``` --- ## Review Annotations ### List Review Annotations ```bash coval review-annotations list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `project_id="abc"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Supported filter fields:** `simulation_output_id`, `metric_id`, `assignee`, `status` (`ACTIVE`/`ARCHIVED`), `completion_status` (`PENDING`/`COMPLETED`), `project_id` **Output columns:** ID, SIMULATION, METRIC, ASSIGNEE, STATUS, PRIORITY ```bash # List all annotations coval review-annotations list # Filter by project coval review-annotations list --filter 'project_id="01HXYZ1234567890ABCDEF"' # Filter pending annotations for a specific assignee coval review-annotations list \ --filter 'completion_status="PENDING" AND assignee="alice@company.com"' # JSON output coval review-annotations list --format json ``` ### Get Review Annotation ```bash coval review-annotations get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID | Returns full annotation details as JSON including ground-truth values, reviewer notes, and completion status. ```bash coval review-annotations get abc123def456ghi789jklm ``` ### Create Review Annotation ```bash coval review-annotations create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--simulation-id` | string | **Yes** | Simulation output ID to link | | `--metric-id` | string | **Yes** | Metric ID to link | | `--assignee` | string | **Yes** | Reviewer email address | | `--ground-truth-float` | number | No | Ground-truth numeric value (auto-completes) | | `--ground-truth-string` | string | No | Ground-truth string value (auto-completes) | | `--notes` | string | No | Reviewer notes | | `--priority` | string | No | `primary` or `standard` (default: `standard`) | ```bash # Create a basic annotation coval review-annotations create \ --simulation-id sim-output-abc123 \ --metric-id metric-accuracy-001 \ --assignee reviewer@company.com # Create with ground truth (auto-completes) coval review-annotations create \ --simulation-id sim-output-abc123 \ --metric-id metric-accuracy-001 \ --assignee reviewer@company.com \ --ground-truth-float 0.95 \ --notes "Verified correct response" ``` ### Update Review Annotation ```bash coval review-annotations update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID to update | | Option | Type | Description | |--------|------|-------------| | `--ground-truth-float` | number | Ground-truth numeric value (auto-completes) | | `--ground-truth-string` | string | Ground-truth string value (auto-completes) | | `--notes` | string | Reviewer notes | | `--assignee` | string | Reassign to a different reviewer | | `--priority` | string | `primary` or `standard` | ```bash # Submit a ground-truth value coval review-annotations update abc123def456ghi789jklm \ --ground-truth-float 0.85 \ --notes "Agent responded accurately but with slight delay" # Reassign an annotation coval review-annotations update abc123def456ghi789jklm \ --assignee new-reviewer@company.com ``` ### Delete Review Annotation ```bash coval review-annotations delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID to delete | ```bash coval review-annotations delete abc123def456ghi789jklm ``` --- ## Completion Statuses | Status | Description | |--------|-------------| | `PENDING` | Annotation has not been reviewed yet | | `COMPLETED` | Ground-truth value has been submitted | ## Annotation Priorities | Priority | Description | |----------|-------------| | `PRIORITY_PRIMARY` | High-priority annotation — surfaces first in reviewer queues | | `PRIORITY_STANDARD` | Default priority | ## Project Types | Type | Description | |------|-------------| | `collaborative` | All reviewers share a single queue with one annotation per simulation-metric pair | | `individual` | Each reviewer gets their own private queue and annotations | > **Tip:** Use `collaborative` projects when building ground-truth datasets. Use `individual` projects when measuring inter-annotator agreement. --- ## Run Templates Source: https://docs.coval.ai/cli/run-templates Create reusable evaluation configurations with the Coval CLI Run templates save a full evaluation configuration — agent, persona, test set, metrics, and parameters — so you can re-launch identical runs without specifying every option each time. ## List Run Templates ```bash coval run-templates list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, AGENT, PERSONA, TEST SET, ITERATIONS, CONCURRENCY ```bash # List all templates coval run-templates list # JSON output coval run-templates list --format json ``` ## Get Run Template ```bash coval run-templates get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID | ```bash coval run-templates get rt_abc123 ``` ## Create Run Template ```bash coval run-templates create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the template | | `--agent-id` | string | No | Agent to evaluate | | `--persona-id` | string | No | Persona for simulations | | `--test-set-id` | string | No | Test set to use | | `--metric-ids` | string | No | Comma-separated metric IDs | | `--mutation-ids` | string | No | Comma-separated mutation IDs | | `--iteration-count` | number | No | Number of iterations per test case | | `--concurrency` | number | No | Max concurrent simulations | | `--sub-sample-size` | number | No | Number of test cases to sample | | `--sub-sample-seed` | number | No | Random seed for sampling | ```bash # Create a basic template coval run-templates create \ --name "Nightly Regression" \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 # Create a template with metrics and concurrency coval run-templates create \ --name "Full Evaluation" \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --metric-ids "met_001,met_002,met_003" \ --iteration-count 3 \ --concurrency 5 ``` ## Update Run Template ```bash coval run-templates update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--agent-id` | string | New agent ID | | `--persona-id` | string | New persona ID | | `--test-set-id` | string | New test set ID | | `--metric-ids` | string | New comma-separated metric IDs | | `--mutation-ids` | string | New comma-separated mutation IDs | | `--iteration-count` | number | New iteration count | | `--concurrency` | number | New concurrency limit | | `--sub-sample-size` | number | New sample size | | `--sub-sample-seed` | number | New sample seed | ```bash coval run-templates update rt_abc123 \ --concurrency 10 \ --iteration-count 5 ``` ## Delete Run Template ```bash coval run-templates delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID to delete | > **Info:** Deleting a run template will fail with a 409 error if it has active scheduled runs. Remove or disable associated scheduled runs first. ```bash coval run-templates delete rt_abc123 ``` --- ## Scheduled Runs Source: https://docs.coval.ai/cli/scheduled-runs Schedule recurring evaluation runs with the Coval CLI Scheduled runs automatically launch evaluations on a recurring basis using a run template and a cron-style schedule expression. ## List Scheduled Runs ```bash coval scheduled-runs list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | | `--enabled` | boolean | — | Filter by enabled status (`true` or `false`) | | `--template-id` | string | — | Filter by run template ID | **Output columns:** ID, NAME, TEMPLATE, SCHEDULE, TIMEZONE, ENABLED, LAST RUN ```bash # List all scheduled runs coval scheduled-runs list # List only enabled schedules coval scheduled-runs list --enabled true # Filter by template coval scheduled-runs list --template-id rt_abc123 ``` ## Get Scheduled Run ```bash coval scheduled-runs get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID | ```bash coval scheduled-runs get sr_abc123 ``` ## Create Scheduled Run ```bash coval scheduled-runs create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name | | `--template-id` | string | **Yes** | Run template to execute | | `--schedule` | string | **Yes** | Cron expression (e.g., `0 9 * * *`) | | `--timezone` | string | No | IANA timezone (default: UTC) | | `--enabled` | boolean | No | Whether the schedule is active | ```bash # Run every day at 9am UTC coval scheduled-runs create \ --name "Daily Regression" \ --template-id rt_abc123 \ --schedule "0 9 * * *" # Run weekdays at 6am Pacific, starting disabled coval scheduled-runs create \ --name "Weekday Check" \ --template-id rt_abc123 \ --schedule "0 6 * * 1-5" \ --timezone "America/Los_Angeles" \ --enabled false ``` ## Update Scheduled Run ```bash coval scheduled-runs update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--schedule` | string | New cron expression | | `--timezone` | string | New IANA timezone | | `--enabled` | boolean | Enable or disable the schedule | ```bash # Disable a schedule coval scheduled-runs update sr_abc123 --enabled false # Change schedule to hourly coval scheduled-runs update sr_abc123 --schedule "0 * * * *" ``` ## Delete Scheduled Run ```bash coval scheduled-runs delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID to delete | ```bash coval scheduled-runs delete sr_abc123 ``` --- ## Dashboards Source: https://docs.coval.ai/cli/dashboards Create and manage dashboards and widgets with the Coval CLI Dashboards provide customizable views for monitoring evaluation results. Each dashboard contains widgets that display charts, tables, or text. ## List Dashboards ```bash coval dashboards list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, CREATED, UPDATED ```bash coval dashboards list ``` ## Get Dashboard ```bash coval dashboards get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID | ```bash coval dashboards get db_abc123 ``` ## Create Dashboard ```bash coval dashboards create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name | ```bash coval dashboards create --name "Production Metrics" ``` ## Update Dashboard ```bash coval dashboards update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | ```bash coval dashboards update db_abc123 --name "Staging Metrics" ``` ## Delete Dashboard ```bash coval dashboards delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID to delete | ```bash coval dashboards delete db_abc123 ``` --- ## Widgets Widgets are visual components that live inside a dashboard. All widget commands are nested under `coval dashboards widgets`. ### List Widgets ```bash coval dashboards widgets list [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | Option | Type | Default | Description | |--------|------|---------|-------------| | `--page-size` | number | 50 | Results per page | **Output columns:** ID, NAME, TYPE, GRID, CREATED ```bash coval dashboards widgets list db_abc123 ``` ### Get Widget ```bash coval dashboards widgets get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID | ```bash coval dashboards widgets get db_abc123 wgt_xyz789 ``` ### Create Widget ```bash coval dashboards widgets create [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Widget display name | | `--type` | string | **Yes** | Widget type (see below) | | `--config` | string | No | JSON config string or `@filepath` to read from file | | `--grid-w` | number | No | Grid width | | `--grid-h` | number | No | Grid height | | `--grid-x` | number | No | Grid X position | | `--grid-y` | number | No | Grid Y position | ### Widget Types | Type | Description | |------|-------------| | `chart` | Line, bar, or area chart visualization | | `table` | Tabular data display | | `text` | Static text or markdown content | ```bash # Create a chart widget coval dashboards widgets create db_abc123 \ --name "Score Trends" \ --type chart \ --config '{"metric_id": "met_001"}' \ --grid-w 6 \ --grid-h 4 # Create a widget with config from a file coval dashboards widgets create db_abc123 \ --name "Detailed Report" \ --type table \ --config @widget-config.json ``` ### Update Widget ```bash coval dashboards widgets update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--type` | string | New widget type | | `--config` | string | New JSON config or `@filepath` | | `--grid-w` | number | New grid width | | `--grid-h` | number | New grid height | | `--grid-x` | number | New grid X position | | `--grid-y` | number | New grid Y position | ```bash coval dashboards widgets update db_abc123 wgt_xyz789 \ --grid-w 12 --grid-h 6 ``` ### Delete Widget ```bash coval dashboards widgets delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID to delete | ```bash coval dashboards widgets delete db_abc123 wgt_xyz789 ``` --- ## SDKs Source: https://docs.coval.ai/sdks/overview Typed clients for the Coval API in TypeScript and Python, generated from the public OpenAPI specs. ## Overview Coval ships typed SDKs for TypeScript and Python. Both are generated from the public OpenAPI specs that power this [API reference](/api-reference/v1/introduction), so the surface area stays in sync with what the API accepts. - [TypeScript](https://www.npmjs.com/package/@coval/sdk): `@coval/sdk` on npm. - [Python](https://pypi.org/project/coval-sdk/): `coval-sdk` on PyPI. > **Info:** Both SDKs are MIT-licensed and live in [coval-ai/coval-examples](https://github.com/coval-ai/coval-examples). ## Install ```bash TypeScript npm install @coval/sdk ``` ```bash Python pip install coval-sdk ``` **Install from source** To install directly from the source repo: ```bash TypeScript # npm has no native "install from a git subdirectory" syntax, so clone and install: git clone https://github.com/coval-ai/coval-examples.git cd coval-examples/typescript-sdk npm install npm run build npm link # or: npm pack && npm install ../coval-examples/typescript-sdk/coval-sdk-*.tgz ``` ```bash Python pip install "git+https://github.com/coval-ai/coval-examples.git#subdirectory=python-sdk" ``` ## Quick start ```ts TypeScript const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, }); const page = await coval.agents.listAgents({ pageSize: 50 }); for (const agent of page.agents ?? []) { console.log(agent.id, agent.display_name); } ``` ```python Python from coval_sdk import AgentsApi, ApiClient, Configuration config = Configuration(host="https://api.coval.dev") with ApiClient(config) as client: client.set_default_header("x-api-key", os.environ["COVAL_API_KEY"]) agents = AgentsApi(client).list_agents(page_size=50) for a in agents.agents: print(a.id, a.display_name) ``` The TypeScript client exposes every v1 API resource as a property on `CovalClient`: `coval.agents`, `coval.conversations`, `coval.simulations`, `coval.traces`, `coval.metrics`, and so on (22 resources total). The Python client is split into one `*Api` class per resource (`AgentsApi`, `ConversationsApi`, `SimulationsApi`, etc.) constructed against a shared `ApiClient`. ## Auth The Coval API gateway requires the header `x-api-key` in **lowercase**. Uppercase `X-API-Key` is rejected with `Missing API Key`. ```ts TypeScript // Handled automatically. The CovalClient middleware sets the header for you. const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY! }); ``` ```python Python # Set the header directly on the ApiClient. Do not rely on per-scheme # config.api_key["ApiKeyAuth"]; the bundled spec splits the security scheme # per tag (Coval_Agents_API_ApiKeyAuth, Coval_Conversations_API_ApiKeyAuth, # etc.), so setting a default header is the pattern that works # across all 22 resources. client.set_default_header("x-api-key", os.environ["COVAL_API_KEY"]) ``` > **Warning:** If you see `{"message": "Missing API Key"}` from the gateway, check that the header name is lowercase. The TypeScript SDK takes care of this for you; the Python SDK leaves it to your `set_default_header` call. ## Errors ```ts TypeScript const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY! }); try { await coval.agents.getAgent({ agentId: 'ag_does_not_exist' }); } catch (err) { if (err instanceof CovalApiError) { console.error(`HTTP ${err.status} ${err.code ?? ''}: ${err.message}`); console.error(` request_id: ${err.requestId ?? '-'}`); console.error(` body: ${JSON.stringify(err.body)}`); } else if (err instanceof CovalNetworkError) { console.error(`Network failure after ${err.attempts} attempt(s): ${err.message}`); } else { throw err; } } ``` ```python Python from coval_sdk.exceptions import ApiException try: AgentsApi(client).get_agent(agent_id="ag_does_not_exist") except ApiException as e: print(f"HTTP {e.status}: {e.reason}") print(f" body: {e.body}") ``` > **Note:** **`CovalApiError` shape**: `status`, `code`, `message`, `requestId`, `body`, `url`, `method`. The `requestId` is sourced from `x-request-id`, `x-amzn-requestid`, or the response body in that order. Include it when you file a support ticket. ### Python: `--raw` fallback for legacy data The Python client uses strict pydantic v2 models for every response, including regex patterns on IDs (e.g., `agent_id` must match `^[A-Za-z0-9]{22}$`). If your org has historical data that predates the current spec, for example an agent created when IDs had a different shape, pydantic will raise a `ValidationError` mid-stream and you will lose the rest of the page. The fix is to call the `*_without_preload_content` variant of any method and parse the response yourself: ```python resp = AgentsApi(client).list_agents_without_preload_content(page_size=50) body = json.loads(resp.data.decode("utf-8")) for a in body["agents"]: print(a.get("id"), a.get("display_name")) ``` Every generated method has a `_without_preload_content` sibling. The TypeScript SDK does not have this issue: its types are emitted as `interface` declarations with no runtime validation, so unknown fields pass through. ## Retries The TypeScript SDK retries with exponential backoff and jitter on transient failures. Defaults: | Setting | Default | Notes | |---|---|---| | Max attempts | `3` | Includes the first attempt. | | Base delay | `200ms` | Effective delay = `base * 2^(attempt-1)` plus jitter. | | Max delay | `5000ms` | Cap on any single backoff. | | Retryable statuses | `408, 429, 500, 502, 503, 504` | Plus network and transport errors. | | `Retry-After` | Honored | Numeric seconds or HTTP-date both accepted. | ```ts // Override const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, retry: { maxAttempts: 5, baseDelayMs: 100, maxDelayMs: 10_000 }, }); // Disable entirely const noRetry = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, retry: false, }); ``` When all attempts are exhausted on a transport error, the client throws `CovalNetworkError`. When a non-retryable status (e.g., `400`, `404`) comes back, it falls through to the normal error path and raises `CovalApiError` immediately. For Python, wrap your own calls with `tenacity` or a similar library. ## Pagination The Coval v1 API uses `next_page_token`-style pagination. The TypeScript SDK ships an async iterator that handles the loop for you: ```ts const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY! }); // Iterate lazily across pages for await (const agent of paginate({ fetchPage: (pageToken) => coval.agents.listAgents({ pageToken, pageSize: 50 }), items: (page) => page.agents, nextToken: (page) => page.next_page_token, })) { console.log(agent.id, agent.display_name); } // Or buffer everything into an array const allAgents = await collectAll({ fetchPage: (pageToken) => coval.agents.listAgents({ pageToken, pageSize: 50 }), items: (page) => page.agents, nextToken: (page) => page.next_page_token, }); ``` `paginate` and `collectAll` work with any list endpoint. Provide `fetchPage`, `items`, and `nextToken` callbacks for the method you are calling. > **Tip:** Pass `maxPages` to cap iteration. Useful when you want a quick preview of a large list without hitting the full backlog. For Python, use a `while` loop on `next_page_token`. See the [`list_agents.py` example](https://github.com/coval-ai/coval-examples/blob/main/python-sdk/examples/list_agents.py) for the shape. ## Customization ### Base URL Point the client at staging, a self-hosted gateway, or a regional endpoint: ```ts TypeScript const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, baseUrl: 'https://staging.api.coval.dev', }); ``` ```python Python config = Configuration(host="https://staging.api.coval.dev") ``` ### Middleware (TypeScript) Inject request and response middleware after the auth and error layers. Useful for logging, OpenTelemetry spans, or attaching request IDs you control: ```ts const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, middleware: [ { async pre(ctx) { console.log(`-> ${ctx.init.method} ${ctx.url}`); }, async post(ctx) { console.log(`<- ${ctx.response.status} ${ctx.url}`); }, }, ], }); ``` ### Custom fetch Swap the underlying `fetch` implementation. For example, to use `undici` with keepalive in a long-running Node process, or to stub network calls in tests: ```ts const dispatcher = new Agent({ keepAliveTimeout: 60_000 }); const coval = new CovalClient({ apiKey: process.env.COVAL_API_KEY!, fetch: ((url, init) => undiciFetch(url, { ...init, dispatcher })) as typeof fetch, }); ``` ## Versioning Both SDKs are generated from the same OpenAPI specs that power this API reference. - **Minor version bumps** (`0.2.x` → `0.3.0`) reflect spec evolution: new endpoints, new optional fields, additional response shapes. Generally safe to upgrade. - **Major version bumps** (`0.x` → `1.0`) signal breaking changes in the client surface (`CovalClient` options, error class shapes, pagination helper signatures). To regenerate locally: ```bash git clone https://github.com/coval-ai/coval-examples.git cd coval-examples node scripts/bundle-spec.mjs # pull and bundle the latest specs bash scripts/generate-sdks.sh # regenerate both clients ``` ## Going deeper - [Source repo](https://github.com/coval-ai/coval-examples): Full SDK source, examples folder, and the generation scripts. Open issues here. - [TypeScript examples](https://github.com/coval-ai/coval-examples/tree/main/typescript-sdk/examples): `list-agents.ts`, `submit-conversation.ts`. Runnable end-to-end samples. - [Python examples](https://github.com/coval-ai/coval-examples/tree/main/python-sdk/examples): `list_agents.py` with `--raw` fallback pattern for legacy data. - [OpenAPI specs](/api-reference/v1/introduction): The specs that power both SDKs are published at `https://api.coval.dev/v1/openapi`. - [API keys](/guides/api-keys): How to mint and rotate API keys from the dashboard. - [CLI](/cli/agents): Prefer the shell? `coval-cli` covers the same surface area. --- ## Evaluations for Agents Source: https://docs.coval.ai/agents/overview Give your AI coding agents the tools and knowledge to evaluate AI quality through Skills, MCP, CLI, or API. Coval works with any AI coding agent. Whether you use Claude Code, Cursor, Windsurf, Codex, or another tool, your agent can run evaluations, manage test sets, and score AI outputs through the interface that fits your workflow. ## Get Started - [Agent Skills](/agents/skills): Install evaluation expertise with one command. Your agent learns how to build test sets, select metrics, and run evals. - [Guided Onboarding](/agents/onboarding): Run `/onboard` and your agent walks you through setting up a complete evaluation from scratch. - [MCP Server](/mcp/overview): Connect the Coval MCP server for native tool access in Claude Desktop, Cursor, and other MCP clients. - [CLI](/cli/overview): The Coval CLI gives agents structured JSON output for scripting evaluations in any terminal. ## Three Ways Agents Use Coval | Layer | What It Does | Install | |-------|-------------|---------| | **Agent Skills** | Teaches agents *how* to evaluate well (knowledge) | `npx skills add coval-ai/coval-external-skills` | | **MCP Server** | Gives agents *tools* to execute evaluations | `npx coval-mcp` | | **CLI** | Runs evaluations from *any terminal* with JSON output | `brew install coval-ai/tap/coval` | Skills and MCP are complementary — Skills give your agent the expertise to design good evaluations, while MCP and CLI let it execute them. Use whichever combination fits your workflow. ## Supported Agents Skills + MCP + CLI Skills + MCP Skills + MCP Skills + CLI CLI + API CLI + API ## AI-Readable Documentation Coval publishes machine-readable documentation following the [llms.txt standard](https://llmstxt.org): - **[llms.txt](https://docs.coval.ai/llms.txt)** — Curated index of all documentation pages (~7KB) - **[llms-full.txt](https://docs.coval.ai/llms-full.txt)** — Complete documentation in a single file (~386KB) Point your agent at these files when it needs context about Coval's platform, API, or concepts. --- ## Guided Onboarding Source: https://docs.coval.ai/agents/onboarding Run /onboard to set up a complete AI evaluation interactively, from connecting your agent to viewing results. The `/onboard` skill guides you through setting up your first Coval evaluation step by step. Your AI coding agent asks questions about your use case, then creates all the resources and launches the evaluation using the Coval CLI. ## Quick Start ```bash # 1. Install Coval skills npx skills add coval-ai/coval-external-skills # 2. Open your AI coding agent (Claude Code, Cursor, etc.) # 3. Run the onboarding skill /onboard ``` The skill handles everything from there — including installing the CLI and authenticating if you haven't already. ## What Gets Created The onboarding flow creates a complete evaluation setup: | Resource | What It Is | |----------|-----------| | **Agent** | Your AI agent connected to Coval (voice, chat, SMS, or WebSocket) | | **Persona** | A simulated caller with voice, language, and behavior settings | | **Test Set** | 3 test cases: happy path, edge case, and compliance scenario | | **Metrics** | Use-case-specific metrics plus built-in audio and conversation metrics | | **Run Template** | Reusable configuration bundling everything above | | **Evaluation Run** | Your first evaluation, launched and monitored | ## The Flow The skill walks through 6 phases: **Step: Setup** Checks if the Coval CLI is installed and you're authenticated. Guides installation if needed. Detects any existing resources so you don't duplicate work. **Step: Connect Agent** Asks your agent type (voice, chat, SMS, WebSocket) and connection details (phone number or endpoint URL). **Step: Discover Use Case** Asks what your agent does (customer support, insurance, healthcare, sales, etc.) and what language it speaks. Creates a persona tailored to your vertical. **Step: Build Test Cases** Generates 3 test cases based on your use case — a happy path, an edge case, and a compliance scenario. Each includes expected behaviors your agent should follow. **Step: Select Metrics** Recommends metrics based on your use case and agent type. Includes custom LLM judge metrics, audio quality metrics (for voice), and built-in metrics like latency and sentiment. **Step: Launch and Review** Bundles everything into a reusable template, launches the evaluation, watches progress, and presents results with scores per test case. ## Supported Verticals The skill includes templates for these use cases, with pre-built personas, test cases, and metrics for each: | Vertical | Persona | Custom Metric | |----------|---------|---------------| | Customer Support | Jordan | Issue Resolution | | Scheduling & Booking | Taylor | Booking Accuracy | | Sales | Morgan | Sales Accuracy | | Insurance Claims | Sarah | Identity Verification | | Healthcare Intake | Michael | HIPAA Compliance | | Restaurant Orders | Alex | Order Accuracy | | Debt Collection | Chris | Regulatory Compliance | | IT Helpdesk | Pat | Ticket Resolution | If your use case doesn't match a vertical, the skill uses a general-purpose template and adapts based on your description. ## After Onboarding Once your first evaluation completes, you can: - **Add more test cases**: `coval test-cases create --test-set-id {id} --input "..."` - **Schedule recurring runs**: `coval scheduled-runs create --template-id {id} --schedule "cron(0 9 * * MON)"` - **Listen to recordings**: `coval simulations audio {sim_id} -o recording.wav` - **Iterate on metrics**: Adjust prompts based on what you learned from results - **View in dashboard**: Visit `app.coval.dev` to see full results with transcripts ## Requirements - An AI coding agent that supports skills (Claude Code, Cursor, Windsurf, Codex, etc.) - An AI agent to evaluate (voice or chat, accessible via phone number or endpoint) - A Coval account ([sign up at coval.dev](https://coval.dev)) --- ## Agent Skills Source: https://docs.coval.ai/agents/skills Install evaluation expertise into your AI coding agent with one command. Agent Skills are modular knowledge packages that teach your AI coding agent how to evaluate effectively. They follow the open [Agent Skills standard](https://agentskills.io) and work with Claude Code, Cursor, Windsurf, Codex, and 40+ other agents. ## Install ```bash npx skills add coval-ai/coval-external-skills ``` This installs all Coval skills into your agent's skills directory. Skills are loaded on demand — only the name and description are in memory until activated. ## Skills vs MCP vs CLI | | Skills | MCP Server | CLI | |---|--------|-----------|-----| | **What it provides** | Knowledge (how to evaluate well) | Tools (execute operations) | Operations (run from terminal) | | **Install** | `npx skills add coval-ai/coval-external-skills` | `npx coval-mcp` | `brew install coval-ai/tap/coval` | | **Use when** | Agent needs to *design* evaluations | Agent needs to *run* evaluations natively | Scripting, CI/CD, any terminal | | **Works with** | Any agent supporting skills | MCP-compatible clients | Any shell environment | We recommend **Skills + CLI** for the most complete experience. Skills teach your agent what to create, and the CLI executes it with structured JSON output. ## Available Skills ### Onboarding - [onboard](/agents/onboarding): Interactive guided setup for your first evaluation. Walks through connecting an agent, creating personas, building test cases, selecting metrics, and launching a run. ### Runs | Skill | Description | |-------|-------------| | **launch-run** | Launch an evaluation run against an AI agent | | **watch-run** | Monitor a run's progress with live status updates | | **quick-eval** | Full workflow — launch, watch, and summarize results in one go | ### Simulations | Skill | Description | |-------|-------------| | **get-results** | Retrieve and analyze simulation results from a run | | **download-audio** | Download audio recordings from voice simulations | ### Resources | Skill | Description | |-------|-------------| | **coval-resources** | Complete reference for all Coval resources, their hierarchy, relationships, API endpoints, and ID formats | ### Dashboards | Skill | Description | |-------|-------------| | **create-dashboard** | Create a new dashboard and populate it with metric widgets | | **add-widget** | Add a chart, table, or text widget to a dashboard | | **manage-dashboard** | Get, update, or delete a dashboard | | **manage-widgets** | List, update, resize, or delete widgets | | **list-dashboards** | List all dashboards with filtering | ### Test Cases | Skill | Description | |-------|-------------| | **huggingface-import** | Import datasets from HuggingFace and convert them to Coval test sets | ### Migrations | Skill | Description | |-------|-------------| | **migrate-bluejay** | Migrate configuration from Bluejay voice AI testing platform to Coval | ### Traces | Skill | Description | |-------|-------------| | **setup-tracing** | Add Coval OpenTelemetry tracing to an agent and validate one real trace | | **optimize-trace-observability** | Improve span coverage and attributes after basic trace ingestion works | | **configure-trace-metrics** | Recommend and create custom trace metrics from real span data | | **debug-traces** | Troubleshoot missing, sparse, duplicated, or incorrectly correlated traces | See [Tracing Skills](/concepts/simulations/traces/tracing-skills) for copy-paste prompts and validation guidance. ### Human Review | Skill | Description | |-------|-------------| | **review-llm-annotations-and-improve-prompt** | Calculate agreement between human and machine labels, then propose improved metric prompts | ## How Skills Work Skills use **progressive disclosure** to stay lightweight: 1. **At startup** (~100 tokens per skill): Only the `name` and `description` are loaded 2. **When activated** (under 5000 tokens): The full skill instructions load when your agent detects a relevant task 3. **On demand**: Reference files (templates, examples) load only when needed This means having all Coval skills installed adds minimal overhead to your agent's context. ## Skill Structure Each skill follows the [Agent Skills spec](https://agentskills.io/specification): ``` skill-name/ ├── SKILL.md # Instructions (required) ├── references/ # Templates, detailed docs (optional) ├── scripts/ # Executable code (optional) └── assets/ # Static resources (optional) ``` ## Source Code All skills are open source: [github.com/coval-ai/coval-external-skills](https://github.com/coval-ai/coval-external-skills) --- ## Context7 Source: https://docs.coval.ai/agents/context7 Access up-to-date Coval documentation directly inside AI coding agents via Context7. [Context7](https://context7.com) indexes library documentation and serves it to AI coding agents through an MCP server. Instead of relying on training data that may be outdated, your agent pulls live Coval docs on demand. ## Why Use Context7 AI coding agents often hallucinate API details or reference outdated patterns. Context7 solves this by fetching current documentation at query time: - **Always current** — pulls from the latest published docs, not stale training data - **Code-first** — returns relevant code snippets and examples, not walls of text - **Zero config** — works out of the box with any MCP-compatible agent ## Coval on Context7 Coval's full documentation is indexed and available: - [Coval on Context7](https://context7.com/llmstxt/coval_dev_llms_txt): Browse the Coval library on Context7 — includes CLI commands, API examples, metric configuration, and more. ## Install the Context7 MCP Server Add Context7 to your agent's MCP configuration: **Claude Code:** ```bash claude mcp add context7 -- npx -y @upstash/context7-mcp@latest ``` **Cursor:** Add to `.cursor/mcp.json`: ```json { "mcpServers": { "context7": { "command": "npx", "args": ["-y", "@upstash/context7-mcp@latest"] } } } ``` **Windsurf:** Add to your MCP config file: ```json { "mcpServers": { "context7": { "command": "npx", "args": ["-y", "@upstash/context7-mcp@latest"] } } } ``` ## Usage Once installed, your agent has two tools available: ### 1. Resolve Library ID Find Coval's library ID by searching for it: ``` resolve-library-id("Coval") → /llmstxt/coval_dev_llms_txt ``` ### 2. Query Documentation Ask questions and get back relevant code snippets and docs: ``` query-docs("/llmstxt/coval_dev_llms_txt", "how to create a metric with the CLI") ``` Your agent calls these tools automatically when it needs Coval context. You can also prompt it explicitly: > "Use Context7 to look up how Coval metrics work" ## Example Workflow Here's what happens when your agent uses Context7 with Coval: **Step: You ask a question** "How do I launch an evaluation run with the Coval CLI?" **Step: Agent resolves the library** The agent calls `resolve-library-id("Coval")` and gets `/llmstxt/coval_dev_llms_txt`. **Step: Agent queries the docs** It calls `query-docs` with your question and gets back current CLI examples and flags. **Step: Agent responds with accurate info** You get a response grounded in the latest Coval documentation, not training data. ## Context7 vs Other Approaches | Approach | Freshness | Setup | Best For | |----------|-----------|-------|----------| | **Context7 MCP** | Live (latest docs) | One command | Quick lookups, any MCP agent | | **Agent Skills** | Updated on install | `npx skills add` | Deep evaluation workflows | | **Coval MCP Server** | Live (API calls) | `npx coval-mcp` | Executing operations directly | | **llms.txt** | Live (latest docs) | Point agent at URL | Manual context loading | Context7 complements Skills and the Coval MCP server. Use Context7 when your agent needs to **look something up**. Use Skills when it needs to **know how to evaluate well**. Use the Coval MCP server when it needs to **execute operations**. --- ## MCP Server Source: https://docs.coval.ai/mcp/overview Use Coval directly from Claude Desktop, Cursor, and other MCP-compatible clients The **Coval MCP Server** enables AI assistants to interact with Coval's evaluation APIs through the [Model Context Protocol](https://modelcontextprotocol.io). ![Coval MCP in Claude Desktop](/images/mcp/claude-desktop.png) ## What You Can Do With the MCP server, you can ask Claude or Cursor to: - **Launch evaluations** - "Run the billing test set against my support agent" - **Monitor runs** - "What's the status of my latest evaluation?" - **Manage agents** - "Create a new voice agent for customer service" - **View metrics** - "Show me the metrics for run abc123" - **Organize tests** - "List my test sets and their configurations" ## Quick Start **Step: Get your API key** Go to [Coval Dashboard](https://app.coval.dev/settings) and copy your API key. **Step: Configure your MCP client** Add the Coval MCP server to your client's config file: **Hosted (Recommended):** Connects to Coval's managed MCP endpoint. No local server to maintain. ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-Key: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Local (NPX):** Runs the MCP server locally on your machine. Useful for custom API endpoints or development. ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` See [Installation](/mcp/installation) for config file locations per client and platform. **Step: Restart your client** Fully quit and reopen your MCP client to load the server. **Step: Start using Coval** Ask Claude: "List my Coval agents" or "Show my recent evaluation runs" ## Available Tools The MCP server exposes 18 tools across 6 categories: | Category | Tools | Description | |----------|-------|-------------| | **Runs** | `list_runs`, `get_run`, `create_run`, `delete_run` | Launch and monitor evaluations | | **Agents** | `list_agents`, `get_agent`, `create_agent`, `update_agent` | Manage agent configurations | | **Test Sets** | `list_test_sets`, `get_test_set`, `create_test_set` | Organize test cases | | **Test Cases** | `list_test_cases`, `get_test_case`, `create_test_case`, `update_test_case` | Manage individual test cases | | **Metrics** | `list_metrics`, `get_metric` | View evaluation metrics | | **Personas** | `list_personas`, `get_persona` | Configure simulated users | - [Tools Reference](/mcp/tools): See complete parameter documentation for all tools ## Example Usage Once connected, you can ask Claude or Cursor things like: - "Show me my recent evaluation runs" - "List all my agents" - "Run an evaluation of my customer-support-agent against the billing-inquiries test set" - "What are the metrics for run abc123?" - "Create a new test set for voice agent scenarios" ## Requirements - Node.js 20+ - Coval API key - MCP-compatible client (Claude Desktop, Cursor, etc.) ## Support - [GitHub Issues](https://github.com/coval-ai/mcp-server/issues) - [Coval Support](mailto:support@coval.dev) --- ## Installation Source: https://docs.coval.ai/mcp/installation Configure the Coval MCP server for your AI assistant ## Config File Locations **macOS:** **Claude Desktop:** `~/Library/Application Support/Claude/claude_desktop_config.json` **Cursor:** `.cursor/mcp.json` in your project root **Claude Code:** `~/.claude/settings.json` or `.mcp.json` in your project root **Windows:** **Claude Desktop:** `%APPDATA%\Claude\claude_desktop_config.json` **Cursor:** `.cursor\mcp.json` in your project root **Claude Code:** `%USERPROFILE%\.claude\settings.json` or `.mcp.json` in your project root **Linux:** **Claude Desktop:** `~/.config/Claude/claude_desktop_config.json` **Cursor:** `.cursor/mcp.json` in your project root **Claude Code:** `~/.claude/settings.json` or `.mcp.json` in your project root ## Hosted Server (Recommended) Connects to Coval's managed MCP endpoint via `mcp-remote`. No local server to maintain — always up to date. **Claude Desktop:** Add to your Claude Desktop config file: ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-Key: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Cursor:** Add to `.cursor/mcp.json` in your project: ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-Key: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Claude Code:** Run in your terminal: ```bash claude mcp add coval -- npx -y mcp-remote https://mcp.coval.dev/mcp --header "X-API-Key: YOUR_API_KEY_HERE" ``` Or add to your `.mcp.json`: ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-Key: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` > **Warning:** Restart your MCP client after modifying the config file. For Claude Desktop, fully quit from the menu bar — don't just close the window. ## Local Server (NPX) Runs the Coval MCP server locally on your machine via npm. Useful if you need a custom API endpoint or want to develop against the server. ```bash npx @covalai/mcp-server ``` **Claude Desktop:** ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Cursor:** Add to `.cursor/mcp.json` in your project: ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` ## Environment Variables | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `COVAL_API_KEY` | **Yes** | - | Your API key from [dashboard](https://app.coval.dev/settings) | | `COVAL_API_BASE_URL` | No | `https://api.coval.dev/v1` | Custom API endpoint (local server only) | | `LOG_LEVEL` | No | `info` | Logging level: `debug`, `info`, `warn`, `error` (local server only) | ## Local Development Clone and run from source: ```bash git clone https://github.com/coval-ai/mcp-server cd coval-mcp-server npm install npm run build # Set your API key export COVAL_API_KEY=your_api_key_here # Run the server npm start ``` ### Testing with MCP Inspector ```bash npm run inspector ``` This launches a web UI for testing tool calls interactively. --- ## Tools Reference Source: https://docs.coval.ai/mcp/tools Complete reference for all MCP server tools ## Run Management ### list_runs List evaluation runs with filtering and pagination. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token from previous response | | `order_by` | string | No | Sort order (e.g., `-create_time` for newest first) | | `filter` | string | No | Filter expression (e.g., `status="COMPLETED"`) | ### get_run Get detailed information about a specific run. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `run_id` | string | **Yes** | The unique run ID | Returns run details including status, progress, and metrics (if completed). ### create_run Launch a new evaluation run. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent ID from `list_agents` | | `persona_id` | string | **Yes** | Persona ID from `list_personas` | | `test_set_id` | string | **Yes** | Test set ID from `list_test_sets` | | `metric_ids` | string[] | No | Specific metrics to evaluate | | `options.iteration_count` | number | No | Iterations per test case (1-10, default: 1) | | `options.concurrency` | number | No | Parallel simulations (1-5, default: 1) | | `metadata` | object | No | Custom metadata for tracking | --- ## Agent Management ### list_agents List all configured agents. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter by `model_type`, `display_name`, etc. | ### get_agent Get detailed configuration for a specific agent. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent ID from `list_agents` | ### create_agent Create a new agent configuration. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `display_name` | string | **Yes** | Human-readable name (1-200 chars) | | `model_type` | string | **Yes** | Agent type (see below) | | `phone_number` | string | No | E.164 format for voice agents | | `endpoint` | string | No | Webhook or WebSocket URL | | `prompt` | string | No | System prompt/instructions | | `metadata` | object | No | Custom metadata | **Model Types:** - `MODEL_TYPE_VOICE` - Inbound voice - `MODEL_TYPE_OUTBOUND_VOICE` - Outbound voice - `MODEL_TYPE_CHAT` - Chat/text - `MODEL_TYPE_SMS` - SMS messaging - `MODEL_TYPE_WEBSOCKET` - WebSocket ### update_agent Update an existing agent configuration. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent to update | | `display_name` | string | No | New name | | `phone_number` | string | No | New phone number | | `endpoint` | string | No | New endpoint URL | | `prompt` | string | No | New system prompt | | `metadata` | object | No | New metadata | --- ## Test Set Management ### list_test_sets List all test sets available for evaluation. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_test_set Get detailed information about a test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | **Yes** | Test set ID from `list_test_sets` | ### create_test_set Create a new test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `display_name` | string | **Yes** | Test set name (1-100 chars) | | `slug` | string | No | URL-friendly ID (auto-generated if omitted) | | `description` | string | No | Test set description | | `test_set_type` | string | No | `DEFAULT`, `SCENARIO`, `TRANSCRIPT`, or `WORKFLOW` | | `test_set_metadata` | object | No | Configuration metadata | | `parameters` | object | No | Test parameterization | --- ## Test Case Management ### list_test_cases List test cases with optional filtering by test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | No | Filter by test set ID | | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_test_case Get detailed information about a test case. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_case_id` | string | **Yes** | Test case ID from `list_test_cases` | ### create_test_case Create a new test case in a test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | **Yes** | Test set to add the case to | | `display_name` | string | **Yes** | Test case name | | `description` | string | No | Test case description | | `input` | object | No | Input data for the test | | `expected_output` | object | No | Expected output for validation | | `metadata` | object | No | Custom metadata | ### update_test_case Update an existing test case. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_case_id` | string | **Yes** | Test case to update | | `display_name` | string | No | New name | | `description` | string | No | New description | | `input` | object | No | New input data | | `expected_output` | object | No | New expected output | | `metadata` | object | No | New metadata | --- ## Metrics ### list_metrics List available evaluation metrics. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | | `include_builtin` | boolean | No | Include built-in metrics | ### get_metric Get detailed configuration for a specific metric. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `metric_id` | string | **Yes** | Metric ID from `list_metrics` | --- ## Personas ### list_personas List available simulated personas for testing. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_persona Get detailed configuration for a specific persona. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `persona_id` | string | **Yes** | Persona ID from `list_personas` | Returns persona configuration including voice settings, language, and behavior. --- ## Getting More Out of Coval with Claude (MCP Guide) Source: https://docs.coval.ai/mcp/beginners-guide A practical guide for Coval UI users who want to work faster using Claude and the MCP server # Getting More Out of Coval with Claude (MCP Guide) If you already use the Coval UI, you know how much you can do — but there's a faster way to work. Coval's MCP server connects Claude directly to your Coval workspace, so you can create and manage evaluations just by describing what you want. No clicking through menus. No copy-pasting prompts. Just tell Claude what you need. --- ## What you can do ### Build test sets faster Instead of manually entering test cases one by one, describe your scenarios to Claude: "Create a test set for a billing support bot — include cases for refund requests, subscription changes, and payment failures." Claude generates and adds them directly to your workspace. ### Create and refine metrics Describe the behavior you want to evaluate in plain language. Claude can draft a Composite Evaluation criteria set, and you can iterate on it conversationally until it captures exactly what matters. ### Trigger simulation runs Ask Claude to kick off a run against a specific agent and test set. No need to navigate to the UI — Claude handles it and can summarize results when it finishes. ### Check results and debug Ask "which test cases are failing?" or "what's the pass rate on my escalation metric?" Claude pulls the data and explains what it finds. --- ## How to get set up MCP requires Claude Desktop (the downloadable app) — it doesn't run in the browser at claude.ai. ### Option 1: Via the app 1. Download Claude Desktop at [anthropic.com/download](https://anthropic.com/download) 2. Open **Settings → Developer → Edit Config** 3. Paste in the Coval MCP server config (copy from the [Installation guide](/mcp/installation)) 4. Restart Claude Desktop ### Option 2: Via terminal ```bash # 1. Install Claude Desktop (if you haven't already) # Download from anthropic.com/download and run the installer # 2. Open the Claude Desktop config file open ~/Library/Application\ Support/Claude/claude_desktop_config.json ``` Add the Coval MCP server entry into the `mcpServers` section: ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-Key: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your-api-key-here" } } } } ``` ```bash # 3. Restart Claude Desktop ``` Your Coval API key can be found in the Coval UI under **Settings → API Keys**. > **Info:** See the [Installation guide](/mcp/installation) for other clients (Cursor, Claude Code), Windows/Linux config paths, and the local NPX setup option. --- ## Troubleshooting Source: https://docs.coval.ai/mcp/troubleshooting Common issues and solutions for the Coval MCP server ## Server Not Connecting If your MCP client reports that it cannot connect to the Coval MCP server: **Try the hosted server instead of local NPX** The most common fix. The hosted server at `mcp.coval.dev` avoids local Node.js version issues and package installation problems entirely. Replace your config with the [hosted server setup](/mcp/installation#hosted-server-recommended) and restart your client. **Check that Node.js 20+ is installed** Both the hosted (`mcp-remote`) and local (`@covalai/mcp-server`) methods require Node.js to run `npx`. ```bash node --version ``` If you see a version below 20 or `command not found`, install Node.js from [nodejs.org](https://nodejs.org). **Windows users:** Make sure Node.js is on your system PATH. After installing, open a new terminal and verify `npx --version` works. **Verify your config file path** The config file must be in the exact location your client expects: | Client | macOS | Windows | |--------|-------|---------| | Claude Desktop | `~/Library/Application Support/Claude/claude_desktop_config.json` | `%APPDATA%\Claude\claude_desktop_config.json` | | Cursor | `.cursor/mcp.json` (project root) | `.cursor\mcp.json` (project root) | | Claude Code | `~/.claude/settings.json` or `.mcp.json` | `%USERPROFILE%\.claude\settings.json` or `.mcp.json` | **Tip:** On macOS, the `Library` folder is hidden by default. In Finder, press `Cmd+Shift+G` and paste the path, or use the terminal: ```bash open ~/Library/Application\ Support/Claude/ ``` **Validate your JSON syntax** A single syntax error (trailing comma, missing quote, extra bracket) will silently break the entire config. Paste your config file contents into [jsonlint.com](https://jsonlint.com) to check for errors. Common mistakes: - Trailing comma after the last item in an object or array - Missing comma between `mcpServers` entries if you have multiple servers - Curly quotes (`"` `"`) instead of straight quotes (`"`) — this happens when copying from some websites or chat apps **Restart your client completely** Closing the window is not enough — the MCP server config is only loaded on startup. - **Claude Desktop (macOS):** Right-click the dock icon → Quit, or use `Cmd+Q` - **Claude Desktop (Windows):** Right-click the system tray icon → Exit - **Cursor:** Close all windows and restart the application ## Authentication Errors **Verify your API key** 1. Go to [Coval Dashboard → Settings → API Keys](https://app.coval.dev/settings) 2. Confirm the key is active (not revoked or expired) 3. Copy the key fresh — don't retype it manually **Check for whitespace or formatting issues** Make sure your API key in the config has: - No leading or trailing spaces - No newlines or line breaks - No curly/smart quotes around it The `COVAL_API_KEY` value should be the raw key string with no extra characters. **Hosted server: check the header format** If using the hosted server method, verify the header argument is exactly: ``` "X-API-Key: ${COVAL_API_KEY}" ``` The `${COVAL_API_KEY}` variable is resolved from the `env` block at runtime. Make sure: - The key name in `env` matches exactly: `COVAL_API_KEY` - The header string includes the space after the colon ## Tools Not Appearing **Check MCP logs** **Claude Desktop (macOS):** ```bash # List MCP log files ls ~/Library/Logs/Claude/mcp*.log # View the most recent log cat ~/Library/Logs/Claude/mcp-server-coval.log ``` **Claude Desktop (Windows):** ``` %APPDATA%\Claude\logs\mcp*.log ``` Look for connection errors, authentication failures, or stack traces. **Test the server manually** Run the local server directly to see if it starts without errors: ```bash COVAL_API_KEY=your_key_here npx -y @covalai/mcp-server ``` If this shows errors, the issue is with the server or your API key — not your client config. **Verify API key has correct permissions** Without a valid API key, the server may start but expose limited or no tools. Ensure your key is active at [app.coval.dev/settings](https://app.coval.dev/settings). ## Permission Denied Errors **macOS: permission denied on config file** If you see `zsh: permission denied` when trying to edit the config file: ```bash # Check current permissions ls -la ~/Library/Application\ Support/Claude/claude_desktop_config.json # Fix permissions (make it readable/writable by you) chmod 644 ~/Library/Application\ Support/Claude/claude_desktop_config.json ``` If the file doesn't exist yet, create it: ```bash # Create the directory if needed mkdir -p ~/Library/Application\ Support/Claude # Create the config file touch ~/Library/Application\ Support/Claude/claude_desktop_config.json ``` Then open it in your editor and paste the config from the [Installation guide](/mcp/installation). **Windows: access denied on config file** Try running your text editor as Administrator, or check that the file isn't marked as read-only: 1. Navigate to `%APPDATA%\Claude\` 2. Right-click `claude_desktop_config.json` → Properties 3. Uncheck "Read-only" if checked **npx permission errors** If `npx` fails with permission errors: ```bash # Clear the npx cache npx clear-npx-cache # Or try running with explicit cache directory npm config set cache /tmp/npm-cache --global ``` On macOS/Linux, avoid using `sudo` with npx — instead fix the npm directory permissions: ```bash mkdir -p ~/.npm-global npm config set prefix '~/.npm-global' export PATH=~/.npm-global/bin:$PATH ``` ## Firewall or Proxy Issues If you're on a corporate network or behind a proxy: - The **hosted server** method needs outbound HTTPS access to `mcp.coval.dev` - The **local NPX** method needs outbound HTTPS access to `registry.npmjs.org` (to download packages) and `api.coval.dev` (for API calls) Check with your IT team if these domains are allowed through your firewall or proxy. If you use an HTTP proxy: ```bash # Set proxy for npm/npx npm config set proxy http://your-proxy:port npm config set https-proxy http://your-proxy:port ``` ## Still Stuck? If none of the above resolves your issue: 1. Gather your MCP logs (see [Check MCP logs](#check-mcp-logs) above) 2. Note which client and OS you're using 3. Reach out to [support@coval.dev](mailto:support@coval.dev) or open an issue on [GitHub](https://github.com/coval-ai/mcp-server/issues)