Skip to main content

What is a metric?

Metrics give you quantitative insights into your agent’s performance, allowing you to see red flags early and understand overall trends. Each metric assesses your agent in a different way. Audio metrics use recordings, either simulated or live, to detect interruptions, measure phonemes per second, assess latency, and more. LLM Judge metrics provide answers to specific questions you have about your transcripts, allowing you to dial in on your unique specifications. LLM Judge metrics can optionally include Trace Context — when enabled, the judge automatically receives a summary of the agent’s OpenTelemetry spans alongside the transcript, enabling evaluation of tool usage, execution order, and behavior that isn’t visible in the transcript alone. Other offerings include Sentiment Analysis, Regex Matching, and many more. Coval provides built-in metrics out of the box, and you can also create custom metrics tailored to your specific needs. All metrics can be applied to Simulated Conversations as well as Live-Monitoring Conversations.

Metric groups

Every metric belongs to one of five groups, organized by how it evaluates a conversation. Browse a group to see the metrics it contains and how to configure them.
GroupHow it evaluatesMetrics
DeterministicRule-based pattern matching, field lookups, and configured comparisons — no model inferenceAgent Fails to Respond · Agent Needs Reprompting · API State · End Reason · Match Expected Output · Metadata Field · Music Detection · Transcript Regex Match · Words per Message (Threshold)
StatisticalDeterministic timing, signal, and acoustic analysis of the callAudio Duration · Interruption Rate · Latency · Speaking Time % · Time to First Audio · Words per Message · Abrupt Pitch Changes · Audio Frequency · Background Noise · Clipping / Codec / Dropout Artifact · Loop Detection · Non-Expressive Pauses · Pause Analysis · Phoneme Stretch · Pitch Variability · Spectrogram Pitch · Speech Artifact Anomaly · Speech Tempo · Syllable Rate · Vocal Fry · Voice Quality · Volume Variance · Volume-Pitch Misalignment · Agent Repeats Itself
ML ModelPurpose-built machine-learning modelsAudio Sentiment · Timbre Drift · Transcript Sentiment · Transcription Error
LLM JudgeA language model evaluates against your promptBinary · Categorical · Numerical · Audio Binary · Audio Categorical · Audio Numerical · Composite Evaluation
TraceComputed from your agent’s OpenTelemetry spansCustom Trace · LLM / STT / TTS Time to First Byte · LLM Token Usage · Tool Call Count · STT Word Error Rate (+ Audio Upload)
Several metrics are configurable — see Metric Prompting Guide for shared authoring techniques. Not sure where to start? Pick metrics by what you’re trying to measure:
  • Voice quality & naturalness (voice agents) — Latency, Interruption Rate, Speech Tempo, Volume-Pitch Misalignment, Pitch Variability.
  • Task resolution & correctness — Composite Evaluation, End Reason, and a custom Binary LLM Judge for your specific success criteria.
  • Responsiveness & reliability — Latency, Agent Fails to Respond, Agent Repeats Itself.
  • Compliance & scripting — Transcript Regex Match (use absent-match to enforce “never say X”), plus a Binary or Categorical LLM Judge for policy and tone checks.
  • Customer experience — Transcript Sentiment, Audio Sentiment, Composite Evaluation.
If your agent follows a defined call flow, add Workflow Verification to catch off-path behavior — generate the workflow in the Agent creation flow and the metric re-traces it against the transcript.

Creating your own metrics

Beyond the built-in metrics, you can author your own — LLM judge prompts, regex checks, tool-call rules, metadata fields, and custom trace extractions. See Metric Prompting Guide for prompt-writing technique, template variables, transcript scope, and trace context, and the Metric Types pages for what each metric does.

Improve your Metrics

To refine a metric, open it from the metrics list and click “Improve Metric.” Select a test set (must be a transcript — tip: copy/paste a simulated transcript into a new set). You can then iterate on the metric’s formulation and see how often it returns YES vs. NO. This helps reduce noise and non-determinism in LLM-judge metrics.
Need custom metrics tailored to your needs? Contact us, and we’ll create them for you.