- Metric versioning with full change history
- Standardized LLM-judge metric templates
- AI-generated run summaries (beta)
- Project-scoped Human Review pages
Metrics now keep a version history, so you can see how a metric changed over time and review prior versions.
Each metric keeps a record of its prior states, newest first. You can review the history in the app or pull it through the v1 API.View in docs ↗
A library of ready-made LLM judge metric templates is now available when you create a metric, and the metric gallery is reorganized into clearer categories.
Pick a standardized judge prompt as a starting point instead of writing one from scratch.
Runs now include an AI-generated summary of what happened, available in beta. Mark a summary helpful or not to help it improve.
The summary renders with clean formatting and a thumbs up or down control. As a beta feature, it should improve over time as we tune it.
Human Review now has project-scoped pages, so you can open a project’s overview and assignments on their own pages and share a direct link to them.
Each project gets its own overview and assignments view with breadcrumbs, so you can navigate to and share specific review work without losing context.View in docs ↗
Coval now validates Pipecat, LiveKit, and WebSocket agent connections when you set them up, so configuration problems surface before you run.
You can now upload your own background sounds for simulations, so agents can be tested against the exact ambient conditions they will face in production.
Telephony call recordings in MP3 format now upload reliably, including low-sample-rate recordings that were previously rejected.
- New pitch variability metric
- New perceived loudness (LUFS) metric
A new metric that flags whether an agent sounds monotone or expressive across a call.
A new metric that measures perceived loudness across a call, so you can catch audio that is too quiet or too loud.
- Create and delete dashboards via the API and CLI
- Broader chat-agent connectivity with SSE streaming
- Per-conversation metadata in metric prompts
Create and delete dashboards programmatically through the API and CLI.
Manage dashboards as part of your own workflows and scripts, without setting each one up by hand in the app. Useful for spinning up consistent dashboards per project or per environment.View in docs ↗
Chat agents now support SSE streaming and a configurable response format.
Dynamic metrics can now reference per-conversation metadata directly in the prompt template.
When your organization is running evaluations beyond your concurrency limits, we will store your data but flag the evaluations as an error. This lets you rerun them later, once you are operating within your concurrency limits.
Metric descriptions auto-populate and the name auto-fills when you create a metric.
- More reliable categorical audio metrics
Categorical audio metrics now flag a clear, descriptive error when no categories are configured, so misconfigurations surface right away.
- Visual IVR Tree Builder
- IVR Flow Adherence metric
- Custom trace aggregations
- Tags for metrics, templates, and test sets
A built-in metric that checks whether a call follows the intended IVR navigation path, with no custom scoring logic required.
The metric scores each call against the IVR path you define, so you can see where calls deviate from the intended route. Results are reported per call and roll up across the run.View in docs ↗
Define and simulate branching IVR call flows visually, directly in Coval.
Lay out the call paths a caller can take, then run simulations against the whole tree to see where agents take the wrong branch. No external diagramming or scripting needed.View in docs ↗
Choose how per-turn scores roll up across a multi-turn trace, for example worst-case or first-occurrence.
Apply custom tags to metrics, run templates, and test sets to organize and filter your library.
New voices are available in the persona picker, ready to use with no setup.
Custom sign-in now works across all configured authentication methods.
Password-protected shared runs now open reliably across different saved password formats.
- Faster, more reliable reports
- Easier metric selection
- More resilient metric scoring
Large reports that used to struggle to load now open reliably even at scale, thanks to a change in how we load them.
The metric picker now uses grouped, nested categories, so the right metric is faster to find and apply.
There is no longer a 1000-character limit on metadata for uploaded conversations.
Auto-generated metrics are now reliably tagged LLM Judge, so AI-scored and rule-based metrics are easy to tell apart.
Speech anomaly, volume variance, and audio sentiment metrics now scope to the channel and time range you configure.
Metrics now do a better job of handling a wider variety of metadata inputs.