Skip to main content
Many metric types are configurable: you supply a prompt, a pattern, or a threshold and Coval runs it like any built-in metric. This page covers the techniques shared across custom metrics. For the metric types themselves, see the Metric Types overview.

Writing effective LLM judge prompts

These techniques apply to any LLM judge metric — Binary, Categorical, Numerical, Composite Evaluation, and Multimodal/Audio. The goal is a prompt that produces reliable, deterministic results across repeated evaluations.

Core principles

Specificity over generality — Define exact evaluation criteria rather than subjective assessments. Use concrete, measurable behaviors instead of abstract concepts, and provide clear boundary conditions for edge cases. Role consistency — Always refer to the AI agent as “the assistant” and to human participants as “the user” or “the customer.” Keep terminology consistent throughout the prompt. Deterministic design — Structure prompts to minimize LLM variance. Provide explicit decision trees where possible and define what constitutes partial vs. complete success.

Best practices

  • Be objective. “Did the assistant acknowledge the user’s concern within their first two responses?” is verifiable; “Did the assistant provide good customer service?” is not.
  • Single focus. One metric should measure one thing. Split resolution and professionalism into separate metrics rather than asking “Did the assistant resolve the issue and stay professional?”
  • Clear logic. Use explicit AND/OR and ANY/ALL operators. When using OR conditions, make it explicit that the result applies if any condition is met (e.g., “Return YES if ANY of the following apply”). Make sure your evaluation logic matches the question — a question phrased with “or” but evaluated with “and” produces wrong results.
Use Coval’s Optimize Metric / Improve Metric button to refine prompt clarity and confidence. Iterate against real test transcripts and aim for >90% consistency across similar evaluations.
The strongest signal for improving an LLM judge prompt is human review. Have reviewers label real transcripts with ground truth, then compare those labels against the metric’s output and tighten the prompt wherever they disagree — the disagreements pinpoint exactly which cases the prompt gets wrong. See Human Reviews.

Advanced techniques

For complex evaluations, structure the prompt to guide the model’s reasoning: Chain of thought — Ask the model to reason through intermediate steps before deciding:
Before making your final determination, consider:
1. What was the user's primary goal?
2. What actions did the assistant take?
3. What was the final outcome?
4. Did the outcome match the user's goal?

Based on this analysis, did the assistant successfully resolve the user's issue?
Few-shot examples — Anchor edge cases with concrete examples:
Examples of what constitutes resolution:
• User: "That fixed it, thanks!" → YES
• User: "I'll try that and call back if needed" → YES
• User: "This is too complicated, forget it" → NO
• User hangs up without confirmation → NO
Hierarchical decision making — Gate the evaluation through ordered steps so ambiguous cases fall to a conservative default.
Example prompt — Issue resolution detection, using ANY-of logic with explicit YES/NO criteria:
Given the transcript, did the assistant successfully resolve the user's primary issue or concern?

Return YES if ANY of the following apply:
• The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you")
• OR the assistant provides a complete solution and the user accepts it without further objection
• OR the user indicates satisfaction with the outcome before ending the conversation
• OR the assistant completes a requested action and the user acknowledges success
• OR the user's question was fully answered and they don't ask follow-up questions about the same issue
• OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries)

Return NO if ANY of the following apply:
• The user states their issue remains unresolved
• OR the conversation ends without addressing the user's main concern
• OR the user expresses frustration or dissatisfaction with the proposed solution
• OR the assistant escalates or transfers the issue without providing any resolution attempt
• OR the user has to repeat their problem multiple times without progress
• OR the assistant admits they cannot help or solve the user's problem

Common issues

IssueSolution
Inconsistent scoringAdd more specific criteria and examples
Edge case failuresInclude explicit handling for boundary conditions
LLM hallucinationUse more structured prompts with clear constraints
Low correlationEnsure the metric measures what you intend to measure
Keep prompts under ~2,000 characters where possible — shorter, focused prompts are faster and more consistent.

Template variables

You can embed dynamic values from agents, test cases, and uploaded conversations into your metric prompts using template variables. The system replaces each placeholder with the actual value for the simulation or conversation being evaluated, so one metric adapts to many configurations.

{{agent.*}} — Agent attributes

Reference custom properties defined on the agent configuration with {{agent.attribute_name}}. The placeholder is replaced with the attribute value from the agent under evaluation.
Given the transcript, did the assistant provide the correct opening hours?

The correct opening hours are: {{agent.opening_hours}}

Return YES if:
• The assistant stated the opening hours as {{agent.opening_hours}}
• The assistant provided hours that match exactly (e.g., "9 AM to 5 PM" matches "9:00am-5:00pm")

Return NO if:
• The assistant provided different opening hours than {{agent.opening_hours}}
• The assistant claimed not to know the opening hours

{{expected_output.*}} — Test case expected outputs

Reference values from a test case’s expected output with {{expected_output.<key>}}. For a test case whose expected output carries source, destination, and ticket_class:
Given the transcript, did the assistant correctly process the flight booking request?

The booking details are:
- Source: {{expected_output.source}}
- Destination: {{expected_output.destination}}
- Ticket Class: {{expected_output.ticket_class}}

Return YES if:
• The assistant confirmed all three details correctly using the exact values:
  {{expected_output.source}}, {{expected_output.destination}}, and {{expected_output.ticket_class}}
You can combine agent and expected-output attributes in the same prompt for comprehensive, context-aware evaluations.

{{customer_metadata.<key>}} — Upload-time values

Reference per-conversation values you provide at upload time with {{customer_metadata.<key>}}. This is ideal for evaluating monitored production calls against ground truth known only at upload time — prices, order totals, confirmation numbers, expected outcomes. Each conversation is judged against its own metadata value. See Customer Metadata (Upload-Time Values).
For comprehensive documentation — nested paths, array indexing, dynamic keys, and full examples — see Attributes.

Transcript scope

Transcript Scope focuses metric evaluation on specific portions of a conversation rather than the entire transcript, reducing noise and improving accuracy for targeted assessments. It is available for all LLM Judge metrics (Binary, Numerical, Categorical) and Multimodal/Audio LLM Judge metrics.

Configuration

Set the Transcript Scope toggle to:
  • Full (default) — Evaluate the entire transcript.
  • Custom — Apply filters to focus on specific messages.
Transcript Scope UI
Limit evaluation to messages from specific speakers:
  • Agent — Only assistant/agent messages
  • User — Only user/customer messages
  • Both — Messages from the selected roles
Useful when you want to assess agent behavior without user input affecting the evaluation, or vice versa.
Limit evaluation to a portion of the conversation:
  • Last N turns — Only the final N message exchanges
  • First N turns — Only the opening N message exchanges
Useful for evaluating specific phases such as greetings, closings, or resolution attempts.
Use caseFilter configuration
Evaluate only agent responsesRole filter: agent
Check the closing of a conversationRange filter: Last 3 turns
Assess user sentiment onlyRole filter: user
Focus on recent contextRange filter: Last N messages
Combine filters for precise control. A Role filter (agent) plus a Range filter (last 5 turns) evaluates just the agent’s closing performance.

Scope for audio metrics

When using Transcript Scope with Multimodal/Audio LLM Judge metrics, Coval automatically:
  1. Filters the transcript to the selected messages.
  2. Uses message timestamps to extract the corresponding audio segments.
  3. Merges adjacent segments (within 0.5 seconds) to avoid artifacts.
  4. Sends only the filtered audio to the LLM.
This enables focused audio evaluations — e.g., the agent’s speech quality in the last 3 turns — while reducing processing time and token cost.

Benefits

  • More accurate — Removes noise from irrelevant messages.
  • Lower cost — Processes less content per evaluation.
  • Faster — Smaller context means quicker LLM responses.

Trace context

Trace Context gives an LLM Judge or Composite Evaluation metric visibility into what your agent actually did — not just what it said — by including OpenTelemetry span data alongside the transcript. When Include Traces is enabled on a custom transcript scope, the judge receives a TRACE CONTEXT: block appended to its prompt. This block summarizes the conversation’s OTel spans: span names, timing windows, and key attributes like tool call names and function arguments. For Composite Evaluation metrics the same block is appended to every per-criterion prompt.

When to enable

Trace context is most valuable when the behavior you want to evaluate isn’t visible in the transcript alone:
Use caseWhy traces help
Verify the agent used the right tools in the right orderTool call spans show what functions were invoked and with what arguments
Catch hallucinations — agent claimed to do something it didn’tTrace spans show whether the action actually occurred
Evaluate retrieval qualityRetrieval spans show what data was fetched before the agent responded
Assess error handlingError spans reveal failures the agent may have silently recovered from

How to enable

  1. Open or create a supported metric — LLM Judge (Binary, Numerical, Categorical, or Audio) or Composite Evaluation.
  2. Set Transcript Scope to Custom.
  3. In the custom scope panel, toggle Include Traces on.
The trace context is appended automatically — no prompt changes are required, though referencing it explicitly improves results.

Requirements

  • Your agent must emit OpenTelemetry traces to Coval. See the OpenTelemetry Traces guide for setup.
  • The simulation must have produced trace data. If none is available, the toggle has no effect and the prompt is sent without a trace block.

Writing prompts that leverage trace context

Reference the trace data explicitly and instruct the judge to reason about both sources. For example, to verify tool usage:
Given the transcript and trace context, did the assistant call the `lookup_account`
function before providing account balance information?

Return YES if:
• The TRACE CONTEXT shows a tool call to `lookup_account` (or equivalent) occurring
  before the agent stated the balance
• The transcript confirms the agent provided balance details

Return NO if:
• The agent mentioned account balance but no `lookup_account` tool call appears in the TRACE CONTEXT
• The tool call appears AFTER the agent stated the balance (out of order)
• The TRACE CONTEXT shows a failed or missing tool call for this operation

Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone.
Add “Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone” to your prompt. This makes the metric degrade gracefully on simulations where traces weren’t captured.