ML Model Metrics

ML Model metrics run a trained model over the call to produce a score or label. Unlike LLM Judge metrics, they use specialized self-hosted models rather than a configurable prompt, so their output categories are fixed.

Metric	Tells you	Unit	Scope
Audio Sentiment	Emotional tone of the speech	category	—
Timbre Drift	Voice consistency over the call	drift score	Audio
Transcript Sentiment	Agent tone from the transcript	category	—
Transcription Error	Transcript accuracy (WER)	WER 0–1	Audio

Audio Sentiment

Classifies the emotional tone of speech for each speaking segment using an audio-based model. Because it listens to vocal tone rather than the words, it catches emotion that a transcript misses. What it measures — The most frequently detected emotion across the call (Neutral, Happy, Angry, or Sad), classified per speaking segment from vocal tone alone rather than spoken content. When to use — Monitoring the overall tone of support calls, tracking whether a caller’s emotion escalates over a conversation, or verifying that a voice agent sounds appropriately calm or upbeat for its use case. How it works — A self-hosted audio sentiment model labels each speaking segment, and the primary value is the most common emotion over the call. Per-segment labels let you follow the sentiment trend turn by turn. How to interpret — Review the frequency of each emotion and the sentiment trend across the conversation; there is no single “good” value, since the right tone depends on the use case. A shift toward Angry or Sad on the caller’s side over the call is a common signal of mounting frustration worth investigating. Configuration — A configurable variant lets you define which emotions count as success sentiments, choose which speaker to evaluate (agent or persona), and set the minimum percentage of speaking segments that must match. It returns YES when the threshold is met, NO otherwise. Requires Audio · Unit sentiment category (Neutral, Happy, Angry, Sad)

Timbre Drift

Detects shifts in the agent’s voice timbre relative to the start of the call — mid-call changes in voice identity that make the agent sound like a different person. What it measures — How far the agent’s voice identity drifts from the start of the call, measured as the maximum embedding distance from an anchor reference, with pitch (F0) drift tracked as an additional signal. When to use — When callers report that the agent’s voice changed during a call, when catching mid-call voice swaps after a TTS failover, or when you suspect your TTS provider is switching voice models or applying inconsistent conditioning mid-session. How it works — A self-hosted speaker-embedding model compares voice embeddings throughout the call against both an anchor reference from the call’s opening seconds and a rolling recent-window baseline. Drift is flagged when either distance exceeds the threshold or pitch deviates substantially from the anchor mean, with both signals contributing to severity. How to interpret — Lower is better; 0 means the voice stayed consistent. Anchor drift catches slow degradation (the voice getting “tired”), while rolling-window drift catches sudden voice swaps. Small drift values (below ~0.2) are within normal TTS session variation; values above 0.3–0.4 are perceivable to most listeners. Configuration — Optional preset (strict, normal, or loose, each setting a tighter or looser drift threshold) and detector-tuning overrides via metric metadata (window sizes, drift threshold, minimum speech requirements). Requires Audio · Unit drift score (0 = consistent, higher = more drift) · Scope Audio

Transcript Sentiment

Classifies the agent’s overall tone across the conversation from the transcript, returning the dominant tone category. What it measures — The dominant tone category expressed by the agent over the full conversation: Rude, Polite, Encouraging, or Professional. It always evaluates the agent’s side of the transcript. When to use — Auditing how the agent’s manner reads to customers, checking that a prompt or persona change hasn’t made the agent come across as curt or rude, or getting a quick tone read on text-only conversations where no audio is available. How it works — An ML model scores the agent’s transcript text against each tone category and returns the category with the highest overall score. How to interpret — Use the dominant category to gauge how the agent’s tone reads; Polite, Encouraging, and Professional generally read positively, while Rude flags a tone issue. A Rude result is worth reviewing alongside the transcript to find the offending turns and adjust the agent’s prompt. Configuration — None (built-in). Requires Transcript · Unit tone category (Rude, Polite, Encouraging, Professional)

Transcription Error

Measures transcription accuracy by comparing the conversation transcript to an independent reference re-transcription that Coval generates from the call audio. The result is reported as Word Error Rate (WER). What it measures — Word Error Rate (WER) between the existing transcript and a reference transcription, counting substitutions, deletions, and insertions against the reference word count. When to use — Validating transcript quality before trusting transcript-based metrics, comparing speech-to-text providers for the agent or persona side, or detecting transcription regressions across runs and audio configurations. How it works — A self-hosted speech-to-text model re-transcribes the call audio, and the result is compared word by word to the existing transcript, surfacing each error as a word-level transcript highlight (substitutions, deletions, and insertions are color-coded). Speaker channels are detected automatically, so it works on both Coval-recorded simulations and uploaded conversations. How to interpret — Lower is better; 0 means a perfect match to the reference re-transcription. WER below 0.10 is excellent, 0.10–0.30 is acceptable for most conversational agents and noisy audio, and above 0.30 may significantly impact understanding of the call. The headline WER reflects errors after configured filtering, so the number always matches the highlighted errors in the transcript view. Configuration — Adjustable via metric metadata: role (agent, persona, or both), min_reference_confidence to drop low-confidence reference errors, min_substitute_similarity to drop near-identical substitutions like spelling variants, and stt_provider_model to choose the reference STT engine. Requires Audio · Unit Word Error Rate (0 = no errors) · Scope Audio

Simulate

Evaluate

Observe

Review

Cookbooks

Sofia

Reference

Audio Sentiment

Timbre Drift

Transcript Sentiment

Transcription Error

​Audio Sentiment

​Timbre Drift

​Transcript Sentiment

​Transcription Error

Audio Sentiment

Timbre Drift

Transcript Sentiment

Transcription Error