| Metric | Tells you | Unit |
|---|---|---|
| Custom Trace | Any numeric signal from your spans | depends |
| LLM Time to First Byte | Language-model responsiveness | seconds |
| LLM Token Usage | Token consumption (cost proxy) | tokens |
| STT Time to First Byte | Speech-to-text responsiveness | seconds |
| STT Word Error Rate | Your STT accuracy vs. Coval’s reference | WER 0–1 |
| STT Word Error Rate (Audio Upload) | Your STT accuracy vs. your ground truth | WER 0–1 |
| Tool Call Count | Tool usage, with per-tool breakdown | count |
| TTS Time to First Byte | Text-to-speech responsiveness | seconds |
Custom Trace
Measures any numeric attribute, count, error rate, or success rate from a named OTel trace span, letting you turn any signal your agent already instruments into a trackable metric. What it measures — Any signal your agent already captures in its spans — latency, confidence scores, token counts, retries — extracted from a chosen attribute and aggregated across all matching spans in a simulation. When to use — When a signal you care about is already in your traces but has no built-in metric: tracking document-retrieval latency across runs to catch regressions after index or embedding changes, monitoring tail latency (p90) of an external API your agent depends on, or extracting per-call confidence scores fromllm spans.
How it works — Queries the named OTel span (any built-in name like llm, tts, stt, llm_tool_call, or a custom name like document_retrieval), reads the chosen attribute, and applies the selected aggregation method across every matching span in the simulation.
How to interpret — Depends on the attribute and aggregation you pick; the good direction follows the underlying signal (e.g. lower is better for latency). Pick the aggregation to match the question: average or median for typical-case behavior, p90/p95/p99 or max for tail and worst-case performance, sum for accumulated counters, and compare values across runs to catch regressions after changes to your agent.
Configuration — span_name and aggregation_method are required; metric_attribute is required for numeric aggregations and optional for count, error_rate, and success_rate. An optional trace_filter scopes which spans are included, and an optional display unit labels the result.
| Field | Description |
|---|---|
| Span Name | The name of the OTel span to query (e.g. llm, tts, stt, llm_tool_call, or any custom span name you emit). |
| Metric Attribute | The span attribute to extract the value from (e.g. retrieval_latency_ms, confidence_score, or another custom numeric attribute key). |
| Aggregation Method | How to aggregate the extracted values across all matching spans in the simulation. |
| Method | Description |
|---|---|
| Average | Mean value across all matching spans. Best for typical-case latency or scores. |
| Median | Median value across all matching spans. More robust to outliers than average. |
| p90 | 90th-percentile value. Best for understanding worst-case performance at scale. |
| p95 | 95th-percentile value. Useful for tail latency on larger samples. |
| p99 | 99th-percentile value. Useful for rare but severe latency spikes. |
| Max | Maximum value observed across all matching spans. Useful for worst-case detection. |
| Min | Minimum value observed across all matching spans. |
| Sum | Total value across all matching spans. Useful for token counts, cost-like counters, and accumulated work. |
| Count | Number of matching spans. Useful for tool calls, retries, fallbacks, handoffs, and critical events. |
| Error Rate | Percentage of matching spans with an error status. |
| Success Rate | Percentage of matching spans with a successful status. |
count, error_rate, and success_rate, the metric aggregates matching spans directly. For numeric aggregations such as average, p95, or sum, choose a numeric span attribute.
Requires OTel traces with your configured span · Unit depends on aggregation
LLM Time to First Byte
How quickly the language model returns its first token — a key driver of how responsive your agent feels. What it measures — The average time in seconds from when the LLM received a request to when it returned its first token, averaged across all turns. When to use — Identifying slow LLM providers, comparing latencies between candidate models, or checking whether prompt-length changes actually made responses faster. Pair it with STT and TTS Time to First Byte to isolate which pipeline stage drives end-to-end latency. How it works — Reads themetrics.ttfb attribute on OTel llm spans and averages the request-to-first-token interval across every turn.
How to interpret — Lower is better; lower values indicate a more responsive language model. Compare across runs to catch latency regressions after a model or prompt change, and compare against the other Time to First Byte metrics to see whether the LLM is the bottleneck.
Configuration — None (built-in).
Requires OTel llm spans with metrics.ttfb · Unit seconds
LLM Token Usage
Total LLM tokens consumed across the conversation — your primary cost proxy. What it measures — Total LLM tokens consumed (input + output) across all turns in the conversation. When to use — Cost monitoring, tracking token growth across runs, comparing prompt strategies for efficiency, or flagging individual conversations that consume excessive tokens. How it works — Sums the input and output token counts from thegen_ai.usage.* attributes on OTel llm spans across every turn.
How to interpret — Token counts vary by model and use case, so track this over time to establish a baseline for your agent. Sudden spikes can indicate prompt injection, runaway tool loops, or excessively long conversations; combine with turn count to compute average tokens per turn.
Configuration — None (built-in).
Requires OTel llm spans with gen_ai.usage.* token attributes · Unit tokens
STT Time to First Byte
How quickly the speech-to-text service returns its first transcription result. What it measures — The average time in seconds from when audio was sent to the STT service to when it returned the first transcription result, averaged across all turns. When to use — Evaluating or comparing STT providers on speed, or diagnosing why the agent is slow to start processing user input — STT is the first stage of the pipeline, so its delay pushes everything downstream. How it works — Reads themetrics.ttfb attribute on OTel stt spans and averages the audio-sent-to-first-result interval across every turn.
How to interpret — Lower is better; it signals faster STT responsiveness. Compare against LLM and TTS Time to First Byte to determine whether speech recognition is the slow stage in your pipeline.
Configuration — None (built-in).
Requires OTel stt spans with metrics.ttfb · Unit seconds
STT Word Error Rate
Accuracy of your agent’s speech-to-text against Coval’s reference transcript — no manual ground truth or test data needed. What it measures — The word error rate of your agent’s speech-to-text, comparing its STT output from trace spans against Coval’s reference transcript. When to use — Comparing STT providers (e.g. Deepgram vs. Whisper vs. Google), regression-testing an STT provider swap, diagnosing why your agent misunderstands users, or tracking STT quality over time. How it works — Reads thetranscript attribute on OTel stt spans (the hypothesis) and compares it against Coval’s auto-generated transcription of the persona’s speech (the reference). The older stt.transcription alias is accepted, but new integrations should emit transcript.
How to interpret — Lower is better. A value of 0.0 means perfect accuracy; 1.0 means fully incorrect. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally; for the WER formula, see Transcription Error.
Configuration — None (built-in). Requires the transcript attribute on each stt span — see Instrumenting STT Spans. If your provider exposes utterance confidence, also send stt.confidence to make low-confidence turns easier to inspect.
Requires OTel stt spans with transcript · Unit WER (0–1)
STT Word Error Rate (Audio Upload)
Accuracy of your agent’s speech-to-text against a transcript you provide. What it measures — The word error rate of your agent’s speech-to-text against a known-correct transcript you provide in the test case’s expected output. When to use — With audio upload test cases where you know exactly what was said — regression-testing STT quality against a canonical script, or benchmarking providers on a fixed recording. How it works — Reads thetranscript attribute on OTel stt spans (the hypothesis) and compares it against the ground_truth_transcript in the test case’s expected output (the reference). Ground truth can be plain text, labeled text with timestamps, or JSON with a messages array; when role labels are present, only persona/user lines are used.
How to interpret — Lower is better, same scale as STT Word Error Rate. Because the reference is your own known-correct text rather than Coval’s transcription, this metric isolates your STT provider’s accuracy without any variance from the reference side.
Configuration — None (built-in). Requires the transcript attribute on each stt span and a ground_truth_transcript key in the test case’s expected output.
Requires OTel stt spans with transcript + a ground-truth transcript on the test case · Unit WER (0–1)
Tool Call Count
Total tool calls made by the agent, with a per-tool breakdown. What it measures — The total number of tool calls made by the agent during the conversation, with a per-tool breakdown. When to use — Verifying the agent uses tools as expected for a scenario — confirming a lookup tool actually fires on account questions, or catching conversations with excessive or missing tool usage after a prompt change. How it works — Countsllm_tool_call span invocations detected in the OTel trace, broken down by tool name.
How to interpret — Compare the count against the tool usage you expect for the scenario. Unexpectedly high counts suggest over-reliance on tools or unnecessary retries and round-trips; zero calls on a tool-dependent scenario flags a failure — the agent never invoked the tool it needed.
Configuration — None (built-in).
Requires OTel llm_tool_call spans · Unit count
TTS Time to First Byte
How quickly the text-to-speech service returns its first audio byte. What it measures — The average time in seconds from when text was sent to the TTS service to when it returned the first audio byte, averaged across all turns. When to use — Evaluating or comparing TTS providers, or identifying bottlenecks in the audio generation stage — TTS is the last hop before the user hears anything, so its delay lands directly on perceived response time. How it works — Reads themetrics.ttfb attribute on OTel tts spans and averages the text-sent-to-first-byte interval across every turn.
How to interpret — Lower is better; it signals faster speech synthesis. Compare against LLM and STT Time to First Byte to determine whether speech synthesis is the slow stage in your pipeline.
Configuration — None (built-in).
Requires OTel tts spans with metrics.ttfb · Unit seconds
