Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coval.ai/llms.txt

Use this file to discover all available pages before exploring further.

Walkthrough

Overview

Custom Trace Metrics let you extract a specific numerical value from your agent’s OpenTelemetry spans and aggregate it across all turns in a simulation. Use Custom Trace Metrics when you have a signal already captured in your traces — latency measurements, confidence scores, token counts, retry attempts — that you want to track and trend across runs.

Prerequisites

Your agent must be instrumented with OpenTelemetry and sending spans to Coval. See the OpenTelemetry Traces guide for setup instructions. If traces are not present for a simulation, the metric will report an error at execution time.

Configuration

When creating a Custom Trace Metric, configure three fields:
FieldDescription
Span NameThe name of the OTel span to query (e.g. llm, tts, stt, llm_tool_call, or any custom span name you emit).
Metric AttributeThe span attribute to extract the value from (e.g. retrieval_latency_ms, confidence_score, or another custom numeric attribute key).
Aggregation MethodHow to aggregate the extracted values across all matching spans in the simulation.

Aggregation Methods

MethodDescription
AverageMean value across all matching spans. Best for typical-case latency or scores.
MedianMedian value across all matching spans. More robust to outliers than average.
p9090th-percentile value. Best for understanding worst-case performance at scale.
p9595th-percentile value. Useful for tail latency on larger samples.
p9999th-percentile value. Useful for rare but severe latency spikes.
MaxMaximum value observed across all matching spans. Useful for worst-case detection.
MinMinimum value observed across all matching spans.
SumTotal value across all matching spans. Useful for token counts, cost-like counters, and accumulated work.
CountNumber of matching spans. Useful for tool calls, retries, fallbacks, handoffs, and critical events.
Error RatePercentage of matching spans with an error status.
Success RatePercentage of matching spans with a successful status.
For count, error_rate, and success_rate, the metric can aggregate matching spans directly. For numeric aggregations such as average, p95, or sum, choose a numeric span attribute.

Span Names

Any span name your agent emits can be queried. The following well-known span names map to Coval’s built-in trace components:
Span NameComponent
llmLanguage model invocations
ttsSpeech synthesis
sttSpeech recognition
llm_tool_callIndividual tool/function calls
turnA single conversation turn
Custom span names (e.g. document_retrieval, database_lookup) work as well — use whatever names your agent emits.

How to Create

1

Open the Metrics page

Navigate to the Metrics section in the Coval dashboard.
2

Click Create Metric

Select Custom Trace Metrics from the metric type group.
3

Configure the metric

Fill in Span Name, Metric Attribute, and Aggregation Method for your use case.
4

Name and save

Give the metric a descriptive name and save. It is now available to add to any run.

Use Cases

Custom Latency Tracking

Extract average document retrieval latency from your custom retrieval spans:
FieldValue
Span Namedocument_retrieval
Metric Attributeretrieval_latency_ms
Aggregation MethodAverage
This gives you the average retrieval latency across all turns in the simulation. Compare it across runs to catch regressions after changes to your index, embeddings, or chunking strategy.

p90 External API Latency

Track tail latency for an external service your agent depends on:
FieldValue
Span Nameweather_api
Metric Attributeduration_ms
Aggregation Methodp90
Use p90 instead of average when you care about tail performance instead of typical performance, especially for services that can occasionally spike.

Tool Call Duration Monitoring

If your agent emits custom spans for specific tool calls with a duration attribute:
FieldValue
Span Namedatabase_lookup
Metric Attributeduration_ms
Aggregation MethodAverage

Confidence Score Extraction

If your agent records a confidence score on each language model span:
FieldValue
Span Namellm
Metric Attributeconfidence_score
Aggregation MethodAverage
Custom Trace Metrics complement built-in trace metrics like LLM Time to First Byte and TTS Time to First Byte. Use the built-in metrics for standard pipeline components and Custom Trace Metrics for signals specific to your agent’s instrumentation.
Want an AI-assisted setup? Use Tracing Skills to have your coding agent inspect real traces, recommend 3-6 useful metrics, and create only metrics backed by span data that exists.