Skip to main content
Statistical metrics quantify timing, signal, and acoustic properties of a conversation through deterministic analysis. Nearly all run without model inference, so results are reproducible for the same input (Agent Repeats Itself is the exception — its default mode uses an LLM judge). Most operate on audio and report a numeric value with a defined unit. Audio metrics you create yourself can take an audio scope to restrict analysis to a specific speaker or time range — see scoping audio metrics. Built-in metrics ship preconfigured, with no scope settings.
MetricTells youUnitScope
Abrupt Pitch ChangesSudden, unnatural pitch jumpsper minuteAudio
Agent Repeats ItselfProblematic repetitionYES / NOTranscript
Audio DurationTotal call lengthseconds
Audio FrequencyVoiced speech above a pitch floorpercentAudio
Background NoiseAudio cleanliness (SNR)dB
Clipping ArtifactSaturated, distorted samplesfractionAudio
Codec ArtifactEncoding distortionseverityAudio
Dropout ArtifactSignal cut-outs mid-speechper minuteAudio
Interruption RateBarge-ins over the userper minute
LatencyAgent response delayseconds
Loop DetectionRepeated TTS loopscountAudio
Non-Expressive PausesFlat, robotic pausesper minuteAudio
Pause AnalysisPause frequency and anomalous pausesper minuteAudio
Perceived LoudnessOverall call loudnessLUFSAudio
Phoneme StretchUnnaturally held sounds (TTS glitch)secondsAudio
Pitch VariabilityMonotone vs. expressive deliverysemitonesAudio
Speaking Time PercentageShare of the call a speaker talkspercentAudio
Spectrogram Pitch AnalysisNatural upper-frequency contentYES / NOAudio
Speech Artifact AnomalyOverall speech delivery quality0–1Audio
Speech TempoSpeech ratephonemes/sAudio
Syllable RateSpeaking pace vs. the normal rangesyllables/sAudio
Time to First AudioDelay before the first soundmilliseconds
Vocal FryCreaky-voice durationsecondsAudio
Voice QualityAcoustic naturalness0–1Audio
Volume VarianceVolume steadinessdBAudio
Volume-Pitch MisalignmentUnnatural prosodyseverityAudio
Words per MessageAgent verbositywords

Conversation & timing

Audio Duration

Total length of the call audio. A simple descriptive measure that anchors per-minute and percentage metrics on the same call. What it measures — The total duration of the call audio in seconds. When to use — Catching calls that run far longer than the scenario should take or that end almost immediately — both often signal a stuck dialogue, an early hangup, or a failed connection. How it works — Reads the length of the audio directly from the recording. How to interpret — A descriptive measure of call length with no inherent good direction — compare against the expected length for your scenario. Sudden shifts across runs usually trace back to a dialogue or configuration change rather than an audio problem. Configuration — None (built-in). Requires Audio · Unit seconds

Interruption Rate

How often the agent starts speaking while the user is still talking, reported as interruptions per minute. Only agent-over-user interruptions count — times the user interrupts the agent are excluded. What it measures — Counts agent-onset-during-user-speech events across the call and normalizes them to a per-minute rate. When to use — Conversation flow analysis — e.g., verifying that a barge-in or endpointing configuration change didn’t make the agent talk over callers, or identifying turn-taking problems reported by users. How it works — Compares the agent and user speaking segments from diarized (speaker-separated) audio and flags points where the agent’s speech begins before the user’s turn ends. How to interpret — Lower is better. A low rate means the agent waits for the user to finish before responding; an elevated rate points to a barge-in or turn-taking problem worth tracing back to your endpointing settings. Configuration — None (built-in). Requires Audio · Unit interruptions per minute

Latency

Average silence gap between the user finishing a turn and the agent starting to respond. What it measures — The average silence gap in seconds between the user finishing a turn and the agent starting to respond, counting gaps of at least 0.5 seconds. When to use — Performance evaluation across model, provider, or pipeline changes — e.g., confirming a new LLM or TTS provider didn’t slow down responses, or identifying slow turns that make conversations feel unresponsive. How it works — Reads the speaking segments detected for the call (voice activity detection runs upstream) and measures the silence between the end of each user segment and the start of the agent’s next one. How to interpret — Lower is better — lower values indicate a faster, more responsive agent. Higher values may indicate performance issues or processing bottlenecks; track across runs to catch regressions when you change models or providers. Configuration — None (built-in). Requires Audio · Unit seconds

Speaking Time Percentage

Share of the call the configured speaker was actively talking. The roles are complementary, so multiple instances give a full picture of call composition. What it measures — The percentage of total call duration that the configured speaker (default: agent) was actively speaking. When to use — Analyzing call composition — e.g., checking whether the agent dominates the conversation, identifying excessive hold music, or measuring dead air. How it works — Sums the configured role’s speaking segments and divides by total call duration; roles (agent, persona, silence, music) are complementary and sum to 100%. How to interpret — Higher values mean the speaker held a greater share of the conversation; interpret relative to expected call composition. Create one instance per role for a full breakdown, and use the highlighted waveform regions to see where each segment falls. Configuration — Speaker scope: agent, persona, or music (silence appears in the breakdown but is not selectable); default agent. Requires Audio · Unit percent · Scope Audio

Time to First Audio

How quickly the conversation produces its first audible sound. What it measures — The time in milliseconds from the start of the recording until the first audible sound is detected. When to use — Evaluating system or agent latency before any speech begins — e.g., measuring how long callers wait before a telephony agent picks up and greets them. How it works — Detects the first audio frame with RMS energy (signal level) above a threshold and returns its timestamp. How to interpret — Lower is better — lower values indicate a faster start. Under 1000 ms is responsive, 1–3 s is acceptable, and above 3000 ms is noticeable lag. A value of -1 ms means no audio was detected — likely a technical failure or silent recording. Configuration — None (built-in). Requires Audio · Unit milliseconds

Words per Message

Average response verbosity of the agent. What it measures — The average number of words the agent used per message across the conversation. When to use — Enforcing brevity guidelines — e.g., verifying a prompt change didn’t make the agent ramble, or comparing verbosity across prompt versions. How it works — Counts words in each agent message and averages across the conversation. How to interpret — Use this to gauge response verbosity; there is no inherent good direction — compare against your length guidelines. Track across runs to catch verbosity drift after prompt changes. Configuration — None (built-in). Requires Transcript · Unit words per message

Audio signal & prosody

Abrupt Pitch Changes

How often the agent’s pitch jumps abruptly during speech — sudden, jittery transitions that make synthesized voices sound unnatural. What it measures — How often pitch changes abruptly between frames during speech, reported as events per minute. When to use — Identifying voice models with unstable or jittery pitch, or comparing voice configurations for smoothness before rolling one out. How it works — Compares pitch frame-by-frame, flags frames exceeding a change threshold, and groups consecutive flagged frames into segments. How to interpret — Lower is better — lower values indicate smoother, more natural-sounding delivery; higher values suggest jittery or unstable pitch. Configuration — Pitch-change threshold in Hz (significant_changes_threshold_hz, default 200). Requires Audio · Unit per minute · Scope Audio

Audio Frequency

Share of voiced speech sitting above a pitch floor — a fingerprint of a voice model’s frequency distribution. What it measures — The percentage of voiced speech frames with a fundamental frequency above the configured threshold. When to use — Comparing voice models or speech synthesis output on frequency distribution — e.g., checking that two TTS providers render the same voice with a similar pitch profile. How it works — Detects voiced frames and computes the fraction whose fundamental frequency clears the threshold (300 Hz by default). How to interpret — Use it to check that a voice model’s frequency profile stays consistent — e.g., after a provider switch or a telephony codec change shifts the spectrum. There is no universal good direction; compare against a known-good baseline for the same voice. Configuration — Frequency threshold in Hz (default 300). The configurable variant also sets the direction (above or below). Requires Audio · Unit percent · Scope Audio

Background Noise

Cleanliness of the call audio relative to background noise. What it measures — The signal-to-noise ratio (SNR) of the call audio in decibels — how loud the speech is relative to the background noise floor. When to use — Audio quality assessment and identifying poor recording conditions — e.g., validating telephony line quality before low SNR starts degrading transcription accuracy. How it works — Compares the overall speech level against the quietest audio windows to estimate the ratio. How to interpret — Higher values mean cleaner audio. Above 20 dB is excellent, 10–20 dB is acceptable for most applications, and below 10 dB may significantly impair speech recognition and comprehension. Configuration — None (built-in). Requires Audio · Unit decibels

Clipping Artifact

Audio clipping where the signal saturates and distorts. What it measures — The fraction of samples driven into saturation, causing distortion. When to use — When agents sound distorted or “crackly,” or when you suspect TTS output levels are misconfigured — common in pipelines where gain staging is not controlled. How it works — Scans the raw audio for runs of consecutive samples above the amplitude threshold and reports the clipped sample fraction; a minimum run length filters out isolated spikes. How to interpret — Lower is better — a higher fraction indicates more severe or frequent clipping. Above ~1% typically produces audible distortion. High clipping with low dropout severity points to a gain staging issue upstream of your TTS provider. Configuration — None (built-in). Requires Audio · Unit fraction · Scope Audio

Codec Artifact

Codec-induced spectral distortion from audio encoding — the “muddy,” over-compressed quality left by low-bitrate codecs or repeated encode/decode cycles. What it measures — The mean severity of frames with anomalous spectral characteristics relative to the call’s own median. When to use — When audio sounds digitally degraded in a way clipping or dropout doesn’t explain — common when audio passes through low-bitrate telephony codecs (e.g., G.711) or multiple encode/decode cycles. How it works — Extracts codec-sensitive spectral features and applies MAD-based outlier detection across segments, normalizing the resulting z-score to a bounded severity. How to interpret — Lower is better — a higher mean severity indicates more pronounced encoding distortion. High codec severity with low clipping and dropout suggests the artifact is encoding-induced, not a gain or signal issue. Narrow-band telephony codecs (e.g., G.711) carry a naturally elevated baseline. Configuration — Optional preset and detector-tuning overrides via metric metadata. Requires Audio · Unit severity · Scope Audio

Dropout Artifact

Brief audio dropouts where the signal cuts out during speech. What it measures — The density of dropout events per minute — brief windows where the signal abruptly falls to near-silence with a steep onset edge. When to use — When callers report choppy audio or cut-out moments, when you see unexplained gaps in waveform visualizations, or to detect network-induced packet loss on telephony integrations. How it works — Computes a rolling energy baseline and flags short windows where RMS drops sharply; the steep-edge requirement distinguishes dropouts from natural pauses. How to interpret — Lower is better. Isolated events (under 0.5/min) are often benign; rates above 1–2/min produce noticeably choppy audio. Clustered dropouts suggest a network or buffering issue; distributed ones suggest a TTS rendering problem. Configuration — None (built-in). Requires Audio · Unit per minute · Scope Audio

Loop Detection

Repeated agent utterances indicating a TTS or dialogue loop. What it measures — The count of distinct loop clusters where the agent repeats an utterance in a way that signals a TTS or dialogue glitch. When to use — When agents occasionally repeat themselves verbatim in a way that doesn’t match the conversation script, or when TTS audio sounds like it stuttered and replayed a segment. How it works — Computes MFCC fingerprints (compact acoustic signatures) across the audio and detects recurring segments, requiring high acoustic similarity between the repeats to distinguish true TTS loops from natural lexical repetition. How to interpret — Lower is better — more distinct loop clusters mean the agent is repeating itself more often. Any confirmed loop is worth investigating, as true loops are rarely intentional. Configuration — Optional preset and detector-tuning overrides via metric metadata (minimum loop duration, repetition count, acoustic-similarity threshold). Requires Audio · Unit count · Scope Audio

Non-Expressive Pauses

Pauses that arrive without preceding pitch movement, which can make the agent sound flat or monotone. What it measures — How often the agent pauses without any preceding pitch movement, reported as events per minute. When to use — Evaluating whether a voice sounds expressive and natural — e.g., detecting monotone delivery in synthesized speech or comparing voice configurations for expressiveness. How it works — Detects pauses above a minimum duration and examines the pitch trajectory in the frames just before each pause, flagging those with minimal variation. How to interpret — Lower is better — lower values mean pauses are accompanied by natural inflection; higher values suggest a flat, robotic cadence where pauses arrive without natural pitch cues. Configuration — Minimum pause duration (default 0.6 s) and pre-pause window (default 5 frames). Requires Audio · Unit per minute · Scope Audio

Pause Analysis

How often the agent pauses mid-speech, with anomalous pauses flagged. What it measures — How frequently the agent pauses during speech (pauses per minute) and flags pauses that are unusually long relative to the speaker’s own baseline. When to use — Identifying unnatural or excessive hesitations in agent speech, detecting processing delays that manifest as in-speech pauses, or evaluating fluency across different voice configurations. How it works — Identifies silent gaps within agent turns and computes a MAD-based z-score (how unusual a pause is relative to the speaker’s own baseline) for each, flagging statistical outliers; persona pauses and inter-turn gaps are excluded. How to interpret — Lower values indicate more fluent speech. Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts. The detail view shows each pause with its timestamp and duration. Configuration — A preset (strict, normal, loose) or an explicit minimum pause duration is required; optional speaker role (default agent) and anomaly z-score threshold. Requires Audio · Unit pauses per minute · Scope Audio

Perceived Loudness

Overall call loudness, measured as integrated loudness in LUFS. What it measures — The integrated loudness of the call audio in LUFS (ITU-R BS.1770). LUFS is an absolute loudness unit, so it is better suited than relative decibels for comparing perceived volume across runs. When to use — Catching volume regressions after changing a TTS provider, voice, codec, normalization step, or telephony path — for example, when an agent becomes noticeably quieter or louder even though the conversation logic is unchanged. How it works — For whole-call metrics, Coval reads the loudness measured during audio normalization, so the metric does not re-download or re-process the audio. For speaker-scoped metrics, Coval re-measures loudness on that speaker’s audio only. The detail view can mark loudness outlier regions on the waveform when short windows differ sharply from the clip’s integrated loudness. How to interpret — Values closer to 0 LUFS are louder; more negative values are quieter. Track against your expected baseline for the same channel and voice. Large shifts usually indicate a gain, encoding, TTS, or normalization change. Speaker-scoped results are useful when one side of the call is too quiet or too loud relative to the other. Configuration — Optional outlier sensitivity in LU and target loudness for display context. Requires Audio · Unit LUFS · Scope Audio

Phoneme Stretch

Unnaturally sustained phonemes that signal a TTS glitch. What it measures — The total duration of phonemes held well beyond natural duration, which signal a TTS synthesis glitch. When to use — When TTS audio sounds like it is “freezing” on a syllable or vowel — common when synthesis models hit unusual input such as long numbers, rare proper nouns, or SSML edge cases. How it works — Flags regions that simultaneously satisfy voiced detection, low pitch jitter, stable fundamental frequency, and minimal spectral movement — all four must hold to exclude natural expressive lengthening. How to interpret — Lower is better. Events under 0.5 s may be expressive prosody, while events above ~1.5 s are almost certainly synthesis defects; higher total stretch duration indicates more severe TTS artifacts. Configuration — Optional preset and detector-tuning overrides via metric metadata (minimum stretch duration, voicing and jitter thresholds). Requires Audio · Unit seconds · Scope Audio

Pitch Variability

How much the agent’s pitch moves over a call — flat and monotone versus lively and expressive. What it measures — The variation in the agent’s pitch (fundamental frequency), reported as the standard deviation of F0 in semitones. Because it is measured in semitones, the score is independent of the speaker’s vocal register, so higher- and lower-voiced agents are directly comparable. When to use — Catching robotic, monotone delivery — e.g., comparing expressiveness across TTS voices, or verifying a voice settings change didn’t flatten the agent’s intonation. How it works — Tracks the agent’s pitch across all voiced speech and measures how widely it ranges; the deep dive shows the pitch contour over time with monotone and expressive stretches highlighted, alongside the agent’s mean pitch and range in Hz for context. How to interpret — Higher values indicate more expressive, dynamic intonation; lower values indicate flatter, more monotone delivery. A very low value can signal robotic or disengaged-sounding speech; a natural conversational voice typically shows moderate variation. Configuration — None (built-in). Requires Audio · Unit semitones (st) · Scope Audio

Spectrogram Pitch Analysis

Whether the audio carries natural upper-frequency content — synthetic or bandwidth-limited audio often lacks energy in higher frequency ranges. What it measures — Whether the agent’s audio contains natural upper-frequency content across the speech spectrum. When to use — Detecting bandwidth-limited or muffled synthesized speech — e.g., comparing voice model configurations for spectral richness, or catching voices that lack harmonic upper-frequency energy. How it works — Splits the audio into windows, measures the fraction of upper-frequency bins above a noise floor in each, and passes when the average fill ratio meets the naturalness threshold. How to interpret — Returns YES when the audio passes the naturalness threshold, NO when it appears bandwidth-limited or synthetic. The detail view shows the fill ratio per window across the recording timeline. Configuration — Optional analysis overrides via metric metadata (segment length, naturalness threshold, noise floor). Requires Audio · Unit YES/NO · Scope Audio

Speech Artifact Anomaly

Composite delivery-quality score across all artifact analyzers — a single number for tracking overall artifact health. What it measures — Overall speech delivery quality combining clipping, dropout, loop, stretch, syllable-rate, voice quality, timbre, and pause artifact checks. When to use — As your primary artifact health signal when you want a single number to track across runs or set an alert threshold on, drilling into the breakdown only when it degrades. How it works — Collects a per-analyzer severity and combines them via weighted average, redistributing weight when an analyzer is skipped so the score stays meaningful. How to interpret — Scores range from 0 (severe issues) to 1 (clean), so higher is better: 0.8 and above is good, 0.6–0.8 fair, 0.4–0.6 poor, and below 0.4 severe. Drill into the per-analyzer breakdown to see which failure mode drives the score. Configuration — None (built-in). Requires Audio · Unit score 0–1 · Scope Audio

Speech Tempo

Agent speech rate measured in phonemes. What it measures — The agent’s speech rate as phonemes produced per second of speaking time. When to use — Tuning TTS speaking rate — e.g., verifying a provider’s speed setting lands in a comfortable range, or comparing pacing across voice models. How it works — Counts phonemes per interval over the duration of each speech segment. How to interpret — Higher values indicate faster delivery. Roughly 10–15 pps is a comfortable target; 15–20 pps is fast but can remain comprehensible, above 20 pps is hard to follow, and below 10 is too slow. Configuration — None (built-in). Requires Audio · Unit phonemes per second · Scope Audio

Syllable Rate

Agent speaking rate in syllables, with out-of-range pacing flagged. What it measures — The agent’s speaking rate in syllables per second, flagging segments where the pace falls outside the normal range for the configured language. When to use — When agent speech sounds robotic or rushed, or to validate that a new TTS model produces natural prosodic rhythm before deploying it. How it works — Measures absolute syllable rate alongside rhythm diagnostics (nPVI, inter-syllable variation) and penalizes rates outside the natural range. How to interpret — Values near 4–5 syl/s are typical for English; the natural range spans roughly 2.5–6.0 syl/s — above it suggests rushed synthesis, below it sluggish or over-paused delivery. Examine the rate together with rhythm — uniformly low variability can indicate robotic cadence. Configuration — Language (default English), a preset, and explicit low/high rate-threshold overrides. Requires Audio · Unit syllables per second · Scope Audio

Vocal Fry

Low, creaky vocal quality from irregular vocal cord vibration, typically at the end of phrases. What it measures — The total duration of vocal fry in the agent’s speech. When to use — Evaluating whether a voice has creaky or rough-sounding artifacts — e.g., monitoring vocal quality across voice configurations, or identifying voices where fry degrades the listener experience. How it works — Identifies frames with simultaneously low pitch, high acoustic roughness, and irregular vocal cord vibration, then groups consecutive flagged frames into fry segments. How to interpret — Lower is better — fewer seconds of fry indicate cleaner, more professional-sounding audio. Occasional brief fry is normal in natural speech; sustained or frequent fry reduces perceived quality. Configuration — Optional detector-tuning overrides via metric metadata (pitch floor and ceiling, jitter and harmonics-to-noise thresholds). Requires Audio · Unit seconds · Scope Audio

Voice Quality

Acoustic naturalness of the agent’s synthesized voice, combining four acoustic dimensions into one score. What it measures — A composite naturalness score from clarity (CPPS, a voice-clarity measure), pitch perturbation (jitter), amplitude perturbation (shimmer), and pitch variability. When to use — When onboarding a new TTS provider or voice model, monitoring for voice degradation over time, or investigating reports of the agent sounding robotic, breathy, or unsteady. How it works — Computes CPPS on LUFS-normalized audio and jitter/shimmer on the raw signal over sliding windows, then combines the four dimensions into a weighted score. How to interpret — Higher scores mean more natural: 0.8 and above is excellent, 0.6–0.8 good, 0.4–0.6 fair, below 0.4 poor. High jitter or shimmer without CPPS degradation often reflects expressive prosody rather than a synthesis defect — check the sub-scores in the breakdown before drawing conclusions. Configuration — None (built-in). Requires Audio · Unit score 0–1 · Scope Audio

Volume Variance

Consistency of the agent’s volume across the call. What it measures — How consistently the agent maintains volume, reported as the standard deviation of loudness in decibels. When to use — Identifying erratic loudness changes in agent speech — e.g., ensuring consistent audio quality across a call, or comparing voice model configurations for volume stability. How it works — Divides agent speech into fixed-length intervals, measures each interval’s loudness, and reports the standard deviation across intervals. How to interpret — Lower is better — lower values indicate steadier, more even audio output. The detail view shows only the problematic intervals (too loud or too soft) with their timestamps and dB values. Configuration — Loud/soft threshold preset (strict, normal, lenient) or individual dB and interval overrides. Requires Audio · Unit decibels · Scope Audio

Volume-Pitch Misalignment

Moments where pitch and volume move in opposite directions — for example, a voice getting louder while its pitch unexpectedly drops. What it measures — How often pitch and volume move in opposite directions during speech, reported as mean misalignment severity. When to use — Evaluating text-to-speech engine quality — e.g., detecting prosody issues that sound “off” to listeners, or comparing voice model configurations. How it works — Analyzes frame-by-frame pitch and volume changes, flags divergent frames, and scores each by how unusual the divergence is relative to the clip’s own baseline (z-scored). How to interpret — Lower is better — lower values indicate better prosodic consistency. Severity around 0–1 is unremarkable, 1–2 means one or both signals sit about a standard deviation above the clip mean, and 2+ marks genuinely unusual frames. Because severity is relative to the clip, scores are comparable across speakers and recordings. Configuration — Minimum intensity change to flag an event (min_volume_change_for_pitch_misalignment, default 7 dB). Requires Audio · Unit severity · Scope Audio

Agent behavior

Agent Repeats Itself

Whether the agent repeated the same phrases or questions. What it measures — Whether the agent repeated the same phrases or questions multiple times in a conversation. When to use — Evaluating naturalness and language diversity — e.g., catching an agent that re-asks a question it already asked, or falls back on the same stock phrase when it gets stuck. How it works — By default an LLM judge reviews the transcript for problematic repetition; a regex mode is available for fully deterministic pattern-based detection. How to interpret — Returns YES if problematic repetition was found, NO otherwise. NO is the good direction — diverse, non-repetitive language sounds more natural. Configuration — Detection mode: llm_judge (default; optional custom prompt and judge model) or regex (pattern, similarity threshold, minimum repetition count, case sensitivity). Requires Transcript · Unit YES/NO · Scope Transcript