Custom Trace Metrics - Coval Documentation

Walkthrough

Overview

Custom Trace Metrics let you extract a specific numerical value from your agent’s OpenTelemetry spans and aggregate it across all turns in a simulation. Use Custom Trace Metrics when you have a signal already captured in your traces — latency measurements, confidence scores, token counts, retry attempts — that you want to track and trend across runs.

Prerequisites

Your agent must be instrumented with OpenTelemetry and sending spans to Coval. See the OpenTelemetry Traces guide for setup instructions. If traces are not present for a simulation, the metric will report an error at execution time.

Configuration

When creating a Custom Trace Metric, configure three fields:

Field	Description
Span Name	The name of the OTel span to query (e.g. `llm`, `tts`, `stt`, `llm_tool_call`, or any custom span name you emit).
Metric Attribute	The span attribute to extract the value from (e.g. `retrieval_latency_ms`, `confidence_score`, or another custom numeric attribute key).
Aggregation Method	How to aggregate the extracted values across all matching spans in the simulation.

Aggregation Methods

Method	Description
Average	Mean value across all matching spans. Best for typical-case latency or scores.
Median	Median value across all matching spans. More robust to outliers than average.
p90	90th-percentile value. Best for understanding worst-case performance at scale.
p95	95th-percentile value. Useful for tail latency on larger samples.
p99	99th-percentile value. Useful for rare but severe latency spikes.
Max	Maximum value observed across all matching spans. Useful for worst-case detection.
Min	Minimum value observed across all matching spans.
Sum	Total value across all matching spans. Useful for token counts, cost-like counters, and accumulated work.
Count	Number of matching spans. Useful for tool calls, retries, fallbacks, handoffs, and critical events.
Error Rate	Percentage of matching spans with an error status.
Success Rate	Percentage of matching spans with a successful status.

For count, error_rate, and success_rate, the metric can aggregate matching spans directly. For numeric aggregations such as average, p95, or sum, choose a numeric span attribute.

Span Names

Any span name your agent emits can be queried. The following well-known span names map to Coval’s built-in trace components:

Span Name	Component
`llm`	Language model invocations
`tts`	Speech synthesis
`stt`	Speech recognition
`llm_tool_call`	Individual tool/function calls
`turn`	A single conversation turn

Custom span names (e.g. document_retrieval, database_lookup) work as well — use whatever names your agent emits.

How to Create

Open the Metrics page

Navigate to the Metrics section in the Coval dashboard.

Click Create Metric

Select Custom Trace Metrics from the metric type group.

Configure the metric

Fill in Span Name, Metric Attribute, and Aggregation Method for your use case.

Name and save

Give the metric a descriptive name and save. It is now available to add to any run.

Use Cases

Custom Latency Tracking

Extract average document retrieval latency from your custom retrieval spans:

Field	Value
Span Name	`document_retrieval`
Metric Attribute	`retrieval_latency_ms`
Aggregation Method	Average

This gives you the average retrieval latency across all turns in the simulation. Compare it across runs to catch regressions after changes to your index, embeddings, or chunking strategy.

p90 External API Latency

Track tail latency for an external service your agent depends on:

Field	Value
Span Name	`weather_api`
Metric Attribute	`duration_ms`
Aggregation Method	p90

Use p90 instead of average when you care about tail performance instead of typical performance, especially for services that can occasionally spike.

Tool Call Duration Monitoring

If your agent emits custom spans for specific tool calls with a duration attribute:

Field	Value
Span Name	`database_lookup`
Metric Attribute	`duration_ms`
Aggregation Method	Average

Confidence Score Extraction

If your agent records a confidence score on each language model span:

Field	Value
Span Name	`llm`
Metric Attribute	`confidence_score`
Aggregation Method	Average

Custom Trace Metrics complement built-in trace metrics like LLM Time to First Byte and TTS Time to First Byte. Use the built-in metrics for standard pipeline components and Custom Trace Metrics for signals specific to your agent’s instrumentation.

Want an AI-assisted setup? Use Tracing Skills to have your coding agent inspect real traces, recommend 3-6 useful metrics, and create only metrics backed by span data that exists.

Documentation Index

​Walkthrough

​Overview

​Prerequisites

​Configuration

​Aggregation Methods

​Span Names

​How to Create

​Use Cases

​Custom Latency Tracking

​p90 External API Latency

​Tool Call Duration Monitoring

​Confidence Score Extraction

Walkthrough

Overview

Prerequisites

Configuration

Aggregation Methods

Span Names

How to Create

Use Cases

Custom Latency Tracking

p90 External API Latency

Tool Call Duration Monitoring

Confidence Score Extraction