Choose a metric - Coval Documentation

The Metric Library is organized by how each metric works. This page flips that around: start from what you want to measure and jump straight to the metrics that measure it.

Just starting out? A solid baseline for most agents is Latency, End Reason, and one custom Binary LLM Judge for your specific success criteria. Add more as you learn where your agent struggles.

Did the agent do its job? (task resolution & correctness)

Metric	What it tells you
Binary LLM Judge	A yes/no answer to your exact success question
Composite Evaluation	Several pass/fail criteria scored together
End Reason	How the conversation ended (resolved, transferred, dropped)
Match Expected Output	Whether the outcome matches a known-correct value

Was it fast and responsive? (latency & reliability)

Metric	What it tells you
Latency	How long the agent takes to respond
Time to First Audio	Delay before the agent starts speaking
Agent Fails to Respond	Turns where the agent went silent
Agent Needs Reprompting	Times the user had to repeat themselves
Agent Repeats Itself	Looping or repeated agent responses
LLM / STT / TTS Time to First Byte	Where latency comes from in your pipeline

Does it sound natural? (voice quality — voice agents)

Metric	What it tells you
Interruption Rate	How often the agent talks over the user
Speech Tempo	Whether the agent speaks too fast or too slow
Pitch Variability	Monotone vs. natural intonation
Volume-Pitch Misalignment	Unnatural volume/pitch mismatches
Voice Quality	Overall acoustic quality of the agent’s speech

See the Statistical metrics page for the full set of acoustic and prosody checks (background noise, artifacts, vocal fry, and more).

How did the customer feel? (sentiment & experience)

Metric	What it tells you
Transcript Sentiment	Sentiment inferred from the text
Audio Sentiment	Sentiment inferred from the audio (tone, not just words)
Composite Evaluation	A blended experience score across criteria you define

Did it follow the rules? (compliance & scripting)

Metric	What it tells you
Transcript Regex Match	Whether required phrases appear — or use absent-match to enforce “never say X”
Binary / Categorical LLM Judge	Policy, tone, and disclosure checks

If your agent follows a defined call flow, add Workflow Verification to catch off-path behavior — generate the workflow in the Agent creation flow and the metric re-traces it against the transcript.

Did it do the right things behind the scenes? (tools & traces)

Metric	What it tells you
Tool Call Count	Whether the agent called the tools it should have
Custom Trace	Any value extracted from your OpenTelemetry spans
API State	State your backend reports for the call
LLM Judge with Trace context	Verify tool usage and order that isn’t visible in the transcript

Was the audio transcribed accurately? (STT accuracy)

Metric	What it tells you
STT Word Error Rate	How accurately speech was transcribed
Transcription Error	Likely transcription mistakes in the conversation

What did it cost? (usage)

Metric	What it tells you
LLM Token Usage	Tokens consumed per conversation

Once you know which metrics you want, add them to a run. For anything custom, see Write judge prompts and Configure metrics.

​Did the agent do its job? (task resolution & correctness)

​Was it fast and responsive? (latency & reliability)

​Does it sound natural? (voice quality — voice agents)

​How did the customer feel? (sentiment & experience)

​Did it follow the rules? (compliance & scripting)

​Did it do the right things behind the scenes? (tools & traces)

​Was the audio transcribed accurately? (STT accuracy)

​What did it cost? (usage)

Did the agent do its job? (task resolution & correctness)

Was it fast and responsive? (latency & reliability)

Does it sound natural? (voice quality — voice agents)

How did the customer feel? (sentiment & experience)

Did it follow the rules? (compliance & scripting)

Did it do the right things behind the scenes? (tools & traces)

Was the audio transcribed accurately? (STT accuracy)

What did it cost? (usage)