Skip to main content
The Metric Library is organized by how each metric works. This page flips that around: start from what you want to measure and jump straight to the metrics that measure it.
Just starting out? A solid baseline for most agents is Latency, End Reason, and one custom Binary LLM Judge for your specific success criteria. Add more as you learn where your agent struggles.

Did the agent do its job? (task resolution & correctness)

MetricWhat it tells you
Binary LLM JudgeA yes/no answer to your exact success question
Composite EvaluationSeveral pass/fail criteria scored together
End ReasonHow the conversation ended (resolved, transferred, dropped)
Match Expected OutputWhether the outcome matches a known-correct value

Was it fast and responsive? (latency & reliability)

MetricWhat it tells you
LatencyHow long the agent takes to respond
Time to First AudioDelay before the agent starts speaking
Agent Fails to RespondTurns where the agent went silent
Agent Needs RepromptingTimes the user had to repeat themselves
Agent Repeats ItselfLooping or repeated agent responses
LLM / STT / TTS Time to First ByteWhere latency comes from in your pipeline

Does it sound natural? (voice quality — voice agents)

MetricWhat it tells you
Interruption RateHow often the agent talks over the user
Speech TempoWhether the agent speaks too fast or too slow
Pitch VariabilityMonotone vs. natural intonation
Volume-Pitch MisalignmentUnnatural volume/pitch mismatches
Voice QualityOverall acoustic quality of the agent’s speech
See the Statistical metrics page for the full set of acoustic and prosody checks (background noise, artifacts, vocal fry, and more).

How did the customer feel? (sentiment & experience)

MetricWhat it tells you
Transcript SentimentSentiment inferred from the text
Audio SentimentSentiment inferred from the audio (tone, not just words)
Composite EvaluationA blended experience score across criteria you define

Did it follow the rules? (compliance & scripting)

MetricWhat it tells you
Transcript Regex MatchWhether required phrases appear — or use absent-match to enforce “never say X”
Binary / Categorical LLM JudgePolicy, tone, and disclosure checks
If your agent follows a defined call flow, add Workflow Verification to catch off-path behavior — generate the workflow in the Agent creation flow and the metric re-traces it against the transcript.

Did it do the right things behind the scenes? (tools & traces)

MetricWhat it tells you
Tool Call CountWhether the agent called the tools it should have
Custom TraceAny value extracted from your OpenTelemetry spans
API StateState your backend reports for the call
LLM Judge with Trace contextVerify tool usage and order that isn’t visible in the transcript

Was the audio transcribed accurately? (STT accuracy)

MetricWhat it tells you
STT Word Error RateHow accurately speech was transcribed
Transcription ErrorLikely transcription mistakes in the conversation

What did it cost? (usage)

MetricWhat it tells you
LLM Token UsageTokens consumed per conversation

Once you know which metrics you want, add them to a run. For anything custom, see Write judge prompts and Configure metrics.