Trace Metrics

Trace metrics read your agent’s OpenTelemetry (OTel) spans rather than the transcript or audio, surfacing internal behavior like model latency, token consumption, and tool calls. They appear in the Tracing view alongside the spans they’re computed from. Use Custom Trace to extract any numeric signal your agent emits. All metrics on this page require your agent to send OTel traces to Coval — see the OpenTelemetry Traces guide for setup and the full span naming reference; if traces are missing, these metrics report an error at execution time.

Metric	Tells you	Unit
Custom Trace	Any numeric signal from your spans	depends
LLM Time to First Byte	Language-model responsiveness	seconds
LLM Token Usage	Token consumption (cost proxy)	tokens
STT Time to First Byte	Speech-to-text responsiveness	seconds
STT Word Error Rate	Your STT accuracy vs. Coval’s reference	WER 0–1
STT Word Error Rate (Audio Upload)	Your STT accuracy vs. your ground truth	WER 0–1
Tool Call Count	Tool usage, with per-tool breakdown	count
TTS Time to First Byte	Text-to-speech responsiveness	seconds

Custom Trace

Measures any numeric attribute, count, error rate, or success rate from a named OTel trace span, letting you turn any signal your agent already instruments into a trackable metric. What it measures — Any signal your agent already captures in its spans — latency, confidence scores, token counts, retries — extracted from a chosen attribute and aggregated across all matching spans in a simulation. When to use — When a signal you care about is already in your traces but has no built-in metric: tracking document-retrieval latency across runs to catch regressions after index or embedding changes, monitoring tail latency (p90) of an external API your agent depends on, or extracting per-call confidence scores from llm spans. How it works — Queries the named OTel span (any built-in name like llm, tts, stt, llm_tool_call, or a custom name like document_retrieval), reads the chosen attribute, and applies the selected aggregation method across every matching span in the simulation. How to interpret — Depends on the attribute and aggregation you pick; the good direction follows the underlying signal (e.g. lower is better for latency). Pick the aggregation to match the question: average or median for typical-case behavior, p90/p95/p99 or max for tail and worst-case performance, sum for accumulated counters, and compare values across runs to catch regressions after changes to your agent. Configuration — span_name and aggregation_method are required; metric_attribute is required for numeric aggregations and optional for count, error_rate, and success_rate. An optional trace_filter scopes which spans are included, and an optional display unit labels the result.

Field	Description
Span Name	The name of the OTel span to query (e.g. `llm`, `tts`, `stt`, `llm_tool_call`, or any custom span name you emit).
Metric Attribute	The span attribute to extract the value from (e.g. `retrieval_latency_ms`, `confidence_score`, or another custom numeric attribute key).
Aggregation Method	How to aggregate the extracted values across all matching spans in the simulation.

Aggregation methods:

Method	Description
Average	Mean value across all matching spans. Best for typical-case latency or scores.
Median	Median value across all matching spans. More robust to outliers than average.
p90	90th-percentile value. Best for understanding worst-case performance at scale.
p95	95th-percentile value. Useful for tail latency on larger samples.
p99	99th-percentile value. Useful for rare but severe latency spikes.
Max	Maximum value observed across all matching spans. Useful for worst-case detection.
Min	Minimum value observed across all matching spans.
Sum	Total value across all matching spans. Useful for token counts, cost-like counters, and accumulated work.
Count	Number of matching spans. Useful for tool calls, retries, fallbacks, handoffs, and critical events.
Error Rate	Percentage of matching spans with an error status.
Success Rate	Percentage of matching spans with a successful status.

For count, error_rate, and success_rate, the metric aggregates matching spans directly. For numeric aggregations such as average, p95, or sum, choose a numeric span attribute. Requires OTel traces with your configured span · Unit depends on aggregation

LLM Time to First Byte

How quickly the language model returns its first token — a key driver of how responsive your agent feels. What it measures — The average time in seconds from when the LLM received a request to when it returned its first token, averaged across all turns. When to use — Identifying slow LLM providers, comparing latencies between candidate models, or checking whether prompt-length changes actually made responses faster. Pair it with STT and TTS Time to First Byte to isolate which pipeline stage drives end-to-end latency. How it works — Reads the metrics.ttfb attribute on OTel llm spans and averages the request-to-first-token interval across every turn. How to interpret — Lower is better; lower values indicate a more responsive language model. Compare across runs to catch latency regressions after a model or prompt change, and compare against the other Time to First Byte metrics to see whether the LLM is the bottleneck. Configuration — None (built-in). Requires OTel llm spans with metrics.ttfb · Unit seconds

LLM Token Usage

Total LLM tokens consumed across the conversation — your primary cost proxy. What it measures — Total LLM tokens consumed (input + output) across all turns in the conversation. When to use — Cost monitoring, tracking token growth across runs, comparing prompt strategies for efficiency, or flagging individual conversations that consume excessive tokens. How it works — Sums the input and output token counts from the gen_ai.usage.* attributes on OTel llm spans across every turn. How to interpret — Token counts vary by model and use case, so track this over time to establish a baseline for your agent. Sudden spikes can indicate prompt injection, runaway tool loops, or excessively long conversations; combine with turn count to compute average tokens per turn. Configuration — None (built-in). Requires OTel llm spans with gen_ai.usage.* token attributes · Unit tokens

STT Time to First Byte

How quickly the speech-to-text service returns its first transcription result. What it measures — The average time in seconds from when audio was sent to the STT service to when it returned the first transcription result, averaged across all turns. When to use — Evaluating or comparing STT providers on speed, or diagnosing why the agent is slow to start processing user input — STT is the first stage of the pipeline, so its delay pushes everything downstream. How it works — Reads the metrics.ttfb attribute on OTel stt spans and averages the audio-sent-to-first-result interval across every turn. How to interpret — Lower is better; it signals faster STT responsiveness. Compare against LLM and TTS Time to First Byte to determine whether speech recognition is the slow stage in your pipeline. Configuration — None (built-in). Requires OTel stt spans with metrics.ttfb · Unit seconds

STT Word Error Rate

Accuracy of your agent’s speech-to-text against Coval’s reference transcript — no manual ground truth or test data needed. What it measures — The word error rate of your agent’s speech-to-text, comparing its STT output from trace spans against Coval’s reference transcript. When to use — Comparing STT providers (e.g. Deepgram vs. Whisper vs. Google), regression-testing an STT provider swap, diagnosing why your agent misunderstands users, or tracking STT quality over time. How it works — Reads the transcript attribute on OTel stt spans (the hypothesis) and compares it against Coval’s auto-generated transcription of the persona’s speech (the reference). The older stt.transcription alias is accepted, but new integrations should emit transcript. How to interpret — Lower is better. A value of 0.0 means perfect accuracy; 1.0 means fully incorrect. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally; for the WER formula, see Transcription Error. Configuration — None (built-in). Requires the transcript attribute on each stt span — see Instrumenting STT Spans. If your provider exposes utterance confidence, also send stt.confidence to make low-confidence turns easier to inspect. Requires OTel stt spans with transcript · Unit WER (0–1)

STT Word Error Rate (Audio Upload)

Accuracy of your agent’s speech-to-text against a transcript you provide. What it measures — The word error rate of your agent’s speech-to-text against a known-correct transcript you provide in the test case’s expected output. When to use — With audio upload test cases where you know exactly what was said — regression-testing STT quality against a canonical script, or benchmarking providers on a fixed recording. How it works — Reads the transcript attribute on OTel stt spans (the hypothesis) and compares it against the ground_truth_transcript in the test case’s expected output (the reference). Ground truth can be plain text, labeled text with timestamps, or JSON with a messages array; when role labels are present, only persona/user lines are used. How to interpret — Lower is better, same scale as STT Word Error Rate. Because the reference is your own known-correct text rather than Coval’s transcription, this metric isolates your STT provider’s accuracy without any variance from the reference side. Configuration — None (built-in). Requires the transcript attribute on each stt span and a ground_truth_transcript key in the test case’s expected output. Requires OTel stt spans with transcript + a ground-truth transcript on the test case · Unit WER (0–1)

Tool Call Count

Total tool calls made by the agent, with a per-tool breakdown. What it measures — The total number of tool calls made by the agent during the conversation, with a per-tool breakdown. When to use — Verifying the agent uses tools as expected for a scenario — confirming a lookup tool actually fires on account questions, or catching conversations with excessive or missing tool usage after a prompt change. How it works — Counts llm_tool_call span invocations detected in the OTel trace, broken down by tool name. How to interpret — Compare the count against the tool usage you expect for the scenario. Unexpectedly high counts suggest over-reliance on tools or unnecessary retries and round-trips; zero calls on a tool-dependent scenario flags a failure — the agent never invoked the tool it needed. Configuration — None (built-in). Requires OTel llm_tool_call spans · Unit count

TTS Time to First Byte

How quickly the text-to-speech service returns its first audio byte. What it measures — The average time in seconds from when text was sent to the TTS service to when it returned the first audio byte, averaged across all turns. When to use — Evaluating or comparing TTS providers, or identifying bottlenecks in the audio generation stage — TTS is the last hop before the user hears anything, so its delay lands directly on perceived response time. How it works — Reads the metrics.ttfb attribute on OTel tts spans and averages the text-sent-to-first-byte interval across every turn. How to interpret — Lower is better; it signals faster speech synthesis. Compare against LLM and STT Time to First Byte to determine whether speech synthesis is the slow stage in your pipeline. Configuration — None (built-in). Requires OTel tts spans with metrics.ttfb · Unit seconds

Simulate

Evaluate

Observe

Review

Cookbooks

Sofia

Reference

Custom Trace

LLM Time to First Byte

LLM Token Usage

STT Time to First Byte

STT Word Error Rate

STT Word Error Rate (Audio Upload)

Tool Call Count

TTS Time to First Byte

​Custom Trace

​LLM Time to First Byte

​LLM Token Usage

​STT Time to First Byte

​STT Word Error Rate

​STT Word Error Rate (Audio Upload)

​Tool Call Count

​TTS Time to First Byte

Custom Trace

LLM Time to First Byte

LLM Token Usage

STT Time to First Byte

STT Word Error Rate

STT Word Error Rate (Audio Upload)

Tool Call Count

TTS Time to First Byte