Metrics - Coval Documentation

A metric turns a conversation into a measurable signal — a score, a yes/no, a category, a latency number — so you can tell whether your agent did its job and track how it performs over time. Coval includes built-in metrics, and you can author your own. Every metric works on both simulated conversations and live-monitored production calls. Pick the path that matches what you’re here to do:

Add your first metric

New here? Create a metric and attach it to a run in a few steps.

Choose a metric

Not sure what to measure? Find the right metric by goal.

Metric Library

Browse all metric types, organized by how each one evaluates.

Write judge prompts

Author LLM-judge metrics that score reliably.

What is a metric?

Each metric assesses your agent in a different way. Audio metrics use recordings — simulated or live — to detect interruptions, measure speech tempo, assess latency, and more. LLM Judge metrics answer specific questions about your transcripts, so you can check your exact success criteria. Others include sentiment analysis, regex matching, trace-based checks on tool calls, and many more. Some metrics are built-in and ready to use; others are configurable, meaning you supply a prompt, a pattern, or a threshold. Either way, you attach metrics to a run and Coval scores every simulation against them.

The metric library

Every metric belongs to one of five groups, organized by how it evaluates a conversation. Browse a group to see the metrics it contains and how to configure each one.

Group	How it evaluates	Metrics
Deterministic	Rule-based pattern matching, field lookups, and configured comparisons — no model inference	Agent Fails to Respond · Agent Needs Reprompting · API State · End Reason · Match Expected Output · Metadata Field · Music Detection · Transcript Regex Match · Words per Message (Threshold)
Statistical	Deterministic timing, signal, and acoustic analysis of the call	Audio Duration · Interruption Rate · Latency · Speaking Time % · Time to First Audio · Words per Message · Abrupt Pitch Changes · Audio Frequency · Background Noise · Clipping / Codec / Dropout Artifact · Loop Detection · Non-Expressive Pauses · Pause Analysis · Phoneme Stretch · Pitch Variability · Spectrogram Pitch · Speech Artifact Anomaly · Speech Tempo · Syllable Rate · Vocal Fry · Voice Quality · Volume Variance · Volume-Pitch Misalignment · Agent Repeats Itself
ML Model	Purpose-built machine-learning models	Audio Sentiment · Timbre Drift · Transcript Sentiment · Transcription Error
LLM Judge	A language model evaluates against your prompt	Binary · Categorical · Numerical · Audio Binary · Audio Categorical · Audio Numerical · Composite Evaluation
Trace	Computed from your agent’s OpenTelemetry spans	Custom Trace · LLM / STT / TTS Time to First Byte · LLM Token Usage · Tool Call Count · STT Word Error Rate (+ Audio Upload)

This table is organized by mechanism. If you’d rather find a metric by what you want to measure — task resolution, latency, sentiment, compliance, voice quality, tool use — start with Choose a metric.

Build your own

Beyond the built-ins, you can author your own metrics — LLM judge prompts, regex checks, tool-call rules, metadata fields, and custom trace extractions.

Write judge prompts

Prompt structure, few-shot examples, and the techniques that make LLM-judge scores consistent.

Configure metrics

Template variables, transcript scope, trace context, and thresholds.

For conditional evaluation logic — running a follow-up metric only when a trigger metric fires — see Metric Chaining.

Version history

Every config-changing save of a metric is recorded in its version history, so you can see how a metric’s scoring configuration changed over time and tell which version a run scored against. See Versioning for how copy-on-save works and how to pull the history through the v1 API.

If you need a metric Coval doesn’t have, contact us and we can build it for you.

Add your first metric

Choose a metric

Metric Library

Write judge prompts

​What is a metric?

​The metric library

​Build your own

Write judge prompts

Configure metrics

​Version history

What is a metric?

The metric library

Build your own

Version history