claims-bot — from instrumentation all the way to a custom LLM judge that catches it making up data.
The Example
Caller dialsclaims-bot after locking their keys in the car. The agent needs to:
- Verify the caller’s identity (DOB + last 4 of policy number)
- Look up their roadside-assistance coverage
- Dispatch a locksmith and quote an ETA
verify_caller, lookup_policy, dispatch_roadside. We want every factual claim the agent makes on the call (the policy tier, the dispatch ETA, the claim ID) to be backed by a real tool result — not invented because a tool errored.
Step One: Instrument claims-bot with OpenTelemetry
Send traces from your agent to Coval using the OpenTelemetry SDK. This captures detailed span data — tool calls, LLM invocations, and other operations — and exports it directly to Coval alongside your simulation results.
Follow the setup guide: OpenTelemetry Traces.
When you instrument the LLM, emit a tool_call span (or llm_tool_call, depending on your convention) per tool invocation with the tool’s arguments, result, and any error. See Instrumenting LLM Spans for the exact shape Coval expects.
For our claims-bot example, a single turn that calls lookup_policy looks like this:
tool_call spans under the LLM turn that produced them: verify_caller, lookup_policy, dispatch_roadside.
Without
tool_call spans, Coval can only show you the surrounding conversation. The span is what unlocks every step below.Step Two: Inspect the Trace for the Locksmith Call
Run a simulation of the locksmith scenario, then open its detail page and click the Traces card. You’ll see each tool call in order with everything you need to debug it:| What you see in the trace | Example value from the locksmith call |
|---|---|
| Tool name | dispatch_roadside |
| Arguments | {"policy_id": "P-48213", "service": "locksmith", "location": "37.7749,-122.4194"} |
| Result | {"claim_id": "RA-90412", "eta_minutes": 35, "provider": "Bay Locksmith Co."} |
| Latency | 1.8s |
| Error (if any) | null |
| Span metadata | model="gpt-4o", turn=4, parent_span=llm_turn |
Step Three: Search Tool Calls Across All claims-bot Simulations
To look across many simulations at once, use Trace Search. Try a query like tool calls in last week to pull every tool call span from the past 7 days:
Open Trace Search →
For claims-bot last week, this surfaces a pattern: dispatch_roadside returned SERVICE_UNAVAILABLE on 12 of 340 calls (≈3.5%). From the search results you can:
- Drill into specific simulations — open any of those 12 calls and see how the agent responded after the error. Did it tell the caller dispatch was unavailable, or did it confidently quote a fake claim ID?
- View the failure matrix — see whether errors cluster on a specific region, policy tier, or time of day.
- Refine the query — narrow to
dispatch_roadside SERVICE_UNAVAILABLEto look only at the failure cohort, or to a single agent if you run multiple.
Step Four: Catch Fabrication With a Custom Trace Metric
Drilling into one of the 12 failure cases above, we want to score automatically — across every future simulation — whether the agent fabricated data after a tool error. We do this with a custom LLM judge metric that runs over the trace, not just the transcript. Create the metric in the dashboard:- Name:
Tool-Grounded Claim Integrity - Metric Type: Text LLM Judge
- Output Type: Binary (YES / NO)
includeTraces=True← this is the critical setting
claims-bot simulations. The metric should:
- Return YES on the clean locksmith run from Step Two (every claim is grounded in a real span).
- Return NO on calls where
dispatch_roadsideerrored but the agent still quoted an ETA — exactly the failure mode Trace Search surfaced in Step Three.
Coming Soon: Latency Metrics on Tool Calls
We’re rolling out custom trace metrics that evaluate tool call timing directly:- “How long did tool calls take to return — avg / p50 / p95?”
- “How long did the
dispatch_roadsidetool specifically take — avg / p95?”

