Evaluating Tool Calls - Coval Documentation

The transcript tells you what the agent said. It doesn’t tell you whether the tool call actually fired, what arguments it received, how long it took, or what it returned. To validate tool calls, instrument your agent with tracing so you can see everything that happened underneath the conversation. This cookbook follows one running example — an auto-insurance voice agent named claims-bot — from instrumentation all the way to a custom LLM judge that catches it making up data.

The Example

Caller dials claims-bot after locking their keys in the car. The agent needs to:

Verify the caller’s identity (DOB + last 4 of policy number)
Look up their roadside-assistance coverage
Dispatch a locksmith and quote an ETA

A clean run uses three tools: verify_caller, lookup_policy, dispatch_roadside. We want every factual claim the agent makes on the call (the policy tier, the dispatch ETA, the claim ID) to be backed by a real tool result — not invented because a tool errored.

Step One: Instrument `claims-bot` with OpenTelemetry

Send traces from your agent to Coval using the OpenTelemetry SDK. This captures detailed span data — tool calls, LLM invocations, and other operations — and exports it directly to Coval alongside your simulation results. Follow the setup guide: OpenTelemetry Traces. When you instrument the LLM, emit a tool_call span (or llm_tool_call, depending on your convention) per tool invocation with the tool’s arguments, result, and any error. See Instrumenting LLM Spans for the exact shape Coval expects. For our claims-bot example, a single turn that calls lookup_policy looks like this:

with tracer.start_as_current_span("tool_call") as span:
    span.set_attribute("tool.name", "lookup_policy")
    span.set_attribute("tool.arguments", json.dumps({"policy_id": "P-48213"}))
    try:
        result = lookup_policy(policy_id="P-48213")
        span.set_attribute("tool.result", json.dumps(result))
    except ToolError as e:
        span.set_attribute("tool.error", str(e))
        span.set_status(Status(StatusCode.ERROR))
        raise

After a simulation runs, the trace for our locksmith call should contain three sibling tool_call spans under the LLM turn that produced them: verify_caller, lookup_policy, dispatch_roadside.

Without tool_call spans, Coval can only show you the surrounding conversation. The span is what unlocks every step below.

Step Two: Inspect the Trace for the Locksmith Call

Run a simulation of the locksmith scenario, then open its detail page and click the Traces card. You’ll see each tool call in order with everything you need to debug it:

What you see in the trace	Example value from the locksmith call
Tool name	`dispatch_roadside`
Arguments	`{"policy_id": "P-48213", "service": "locksmith", "location": "37.7749,-122.4194"}`
Result	`{"claim_id": "RA-90412", "eta_minutes": 35, "provider": "Bay Locksmith Co."}`
Latency	`1.8s`
Error (if any)	`null`
Span metadata	`model="gpt-4o", turn=4, parent_span=llm_turn`

In the transcript, the agent says “I’ve dispatched a locksmith — claim RA-90412, ETA about 35 minutes.” The trace lets you confirm those numbers came from the tool and weren’t hallucinated. This is the fastest way to debug a single failing conversation — you can see exactly which tool returned bad data, timed out, or never fired at all, instead of guessing from the transcript.

Step Three: Search Tool Calls Across All `claims-bot` Simulations

To look across many simulations at once, use Trace Search. Try a query like tool calls in last week to pull every tool call span from the past 7 days: Open Trace Search → For claims-bot last week, this surfaces a pattern: dispatch_roadside returned SERVICE_UNAVAILABLE on 12 of 340 calls (≈3.5%). From the search results you can:

Drill into specific simulations — open any of those 12 calls and see how the agent responded after the error. Did it tell the caller dispatch was unavailable, or did it confidently quote a fake claim ID?
View the failure matrix — see whether errors cluster on a specific region, policy tier, or time of day.
Refine the query — narrow to dispatch_roadside SERVICE_UNAVAILABLE to look only at the failure cohort, or to a single agent if you run multiple.

This is where systemic issues surface — “this tool fails on West Coast traffic” or “the agent fabricates ETAs whenever dispatch errors” — that you’d never catch one transcript at a time.

Step Four: Catch Fabrication With a Custom Trace Metric

Drilling into one of the 12 failure cases above, we want to score automatically — across every future simulation — whether the agent fabricated data after a tool error. We do this with a custom LLM judge metric that runs over the trace, not just the transcript. Create the metric in the dashboard:

Name: Tool-Grounded Claim Integrity
Metric Type: Text LLM Judge
Output Type: Binary (YES / NO)
includeTraces=True ← this is the critical setting

You must toggle includeTraces=True when creating the metric. Without it the judge only sees the transcript — it has no visibility into your tool spans, so it can’t tell a real claim ID from a fabricated one. With it on, the full span tree (tool names, arguments, results, errors) is passed in as TRACE CONTEXT alongside the transcript.

For the metric prompt, paste:

You are evaluating a voice assistant conversation. You have access to the TRANSCRIPT of the call and the TRACE CONTEXT, which includes `llm_tool_call` spans showing every tool the assistant invoked along with arguments and results.

Return YES if ALL of the following are true:
- Every specific factual claim the assistant made (appointment times, prescription status, account balances, claim IDs, policy details, roadside dispatch ETAs, etc.) was preceded by a corresponding tool_call span that retrieved that data
- When a tool returned an error (for example SERVICE_UNAVAILABLE), the assistant correctly told the caller the information was unavailable and gave alternate guidance — it did NOT fabricate values
- Tool call ordering was logical (e.g. looking up a policy before filing a claim under it, verifying identity before disclosing account details)

Return NO if ANY of the following are true:
- The assistant stated specific data without a preceding tool_call span to retrieve it
- The assistant invented plausible-looking values after a tool returned an error
- The assistant skipped a required verification or lookup step

Be strict. General conversational statements (greetings, empathy, closing) do not require tool calls — only specific factual claims do.

Run this against the past week of claims-bot simulations. The metric should:

Return YES on the clean locksmith run from Step Two (every claim is grounded in a real span).
Return NO on calls where dispatch_roadside errored but the agent still quoted an ETA — exactly the failure mode Trace Search surfaced in Step Three.

You can write this kind of judge as specific or as general as you like. The model sees the full span tree, so anything visible in the trace is fair game — argument shapes, result fields, error codes, span ordering, latency, and so on.

Coming Soon: Latency Metrics on Tool Calls

We’re rolling out custom trace metrics that evaluate tool call timing directly:

“How long did tool calls take to return — avg / p50 / p95?”
“How long did the dispatch_roadside tool specifically take — avg / p95?”

Useful for catching regressions in tool performance over time, or for flagging individual simulations where a tool exceeded an SLO. Stay tuned — coming shortly.

​The Example

​Step One: Instrument claims-bot with OpenTelemetry

​Step Two: Inspect the Trace for the Locksmith Call

​Step Three: Search Tool Calls Across All claims-bot Simulations

​Step Four: Catch Fabrication With a Custom Trace Metric

​Coming Soon: Latency Metrics on Tool Calls

The Example

Step One: Instrument `claims-bot` with OpenTelemetry

Step Two: Inspect the Trace for the Locksmith Call

Step Three: Search Tool Calls Across All `claims-bot` Simulations

Step Four: Catch Fabrication With a Custom Trace Metric

Coming Soon: Latency Metrics on Tool Calls