Did the agent do its job? (task resolution & correctness)
| Metric | What it tells you |
|---|---|
| Binary LLM Judge | A yes/no answer to your exact success question |
| Composite Evaluation | Several pass/fail criteria scored together |
| End Reason | How the conversation ended (resolved, transferred, dropped) |
| Match Expected Output | Whether the outcome matches a known-correct value |
Was it fast and responsive? (latency & reliability)
| Metric | What it tells you |
|---|---|
| Latency | How long the agent takes to respond |
| Time to First Audio | Delay before the agent starts speaking |
| Agent Fails to Respond | Turns where the agent went silent |
| Agent Needs Reprompting | Times the user had to repeat themselves |
| Agent Repeats Itself | Looping or repeated agent responses |
| LLM / STT / TTS Time to First Byte | Where latency comes from in your pipeline |
Does it sound natural? (voice quality — voice agents)
| Metric | What it tells you |
|---|---|
| Interruption Rate | How often the agent talks over the user |
| Speech Tempo | Whether the agent speaks too fast or too slow |
| Pitch Variability | Monotone vs. natural intonation |
| Volume-Pitch Misalignment | Unnatural volume/pitch mismatches |
| Voice Quality | Overall acoustic quality of the agent’s speech |
How did the customer feel? (sentiment & experience)
| Metric | What it tells you |
|---|---|
| Transcript Sentiment | Sentiment inferred from the text |
| Audio Sentiment | Sentiment inferred from the audio (tone, not just words) |
| Composite Evaluation | A blended experience score across criteria you define |
Did it follow the rules? (compliance & scripting)
| Metric | What it tells you |
|---|---|
| Transcript Regex Match | Whether required phrases appear — or use absent-match to enforce “never say X” |
| Binary / Categorical LLM Judge | Policy, tone, and disclosure checks |
Did it do the right things behind the scenes? (tools & traces)
| Metric | What it tells you |
|---|---|
| Tool Call Count | Whether the agent called the tools it should have |
| Custom Trace | Any value extracted from your OpenTelemetry spans |
| API State | State your backend reports for the call |
| LLM Judge with Trace context | Verify tool usage and order that isn’t visible in the transcript |
Was the audio transcribed accurately? (STT accuracy)
| Metric | What it tells you |
|---|---|
| STT Word Error Rate | How accurately speech was transcribed |
| Transcription Error | Likely transcription mistakes in the conversation |
What did it cost? (usage)
| Metric | What it tells you |
|---|---|
| LLM Token Usage | Tokens consumed per conversation |
Once you know which metrics you want, add them to a run. For anything custom, see Write judge prompts and Configure metrics.