> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coval.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Write judge prompts

> Prompt structure, examples, and techniques that make LLM-judge metrics score reliably.

These techniques apply to any LLM judge metric — [Binary, Categorical, Numerical](/concepts/metrics/types/llm-judge), [Composite Evaluation](/concepts/metrics/types/llm-judge#composite-evaluation), and [Audio/Multimodal](/concepts/metrics/types/llm-judge). The goal is a prompt that produces reliable, deterministic results across repeated evaluations.

For the values you can inject into a prompt and the scope it runs against, see [Configure metrics](/concepts/metrics/configuring-metrics).

## Core principles

**Specificity over generality** — Define exact evaluation criteria rather than subjective assessments. Use concrete, measurable behaviors instead of abstract concepts, and provide clear boundary conditions for edge cases.

**Role consistency** — Always refer to the AI agent as **"the assistant"** and to human participants as **"the user"** or **"the customer."** Keep terminology consistent throughout the prompt.

**Deterministic design** — Structure prompts to minimize LLM variance. Provide explicit decision trees where possible and define what constitutes partial vs. complete success.

## Best practices

* **Be objective.** "Did the assistant acknowledge the user's concern within their first two responses?" is verifiable; "Did the assistant provide good customer service?" is not.
* **Single focus.** One metric should measure one thing. Split resolution and professionalism into separate metrics rather than asking "Did the assistant resolve the issue *and* stay professional?"
* **Clear logic.** Use explicit AND/OR and ANY/ALL operators. When using OR conditions, make it explicit that the result applies if **any** condition is met (e.g., "Return YES if ANY of the following apply"). Make sure your evaluation logic matches the question — a question phrased with "or" but evaluated with "and" produces wrong results.

<Tip>
  Use Coval's **Optimize Metric** / **Improve Metric** button to refine prompt clarity and confidence. Iterate against real test transcripts and aim for >90% consistency across similar evaluations.
</Tip>

The strongest signal for improving an LLM judge prompt is **human review**. Have reviewers label real transcripts with ground truth, then compare those labels against the metric's output and tighten the prompt wherever they disagree — the disagreements pinpoint exactly which cases the prompt gets wrong. See [Human Reviews](/concepts/metrics/human-review/human-review).

## Advanced techniques

For complex evaluations, structure the prompt to guide the model's reasoning:

**Chain of thought** — Ask the model to reason through intermediate steps before deciding:

```
Before making your final determination, consider:
1. What was the user's primary goal?
2. What actions did the assistant take?
3. What was the final outcome?
4. Did the outcome match the user's goal?

Based on this analysis, did the assistant successfully resolve the user's issue?
```

**Few-shot examples** — Anchor edge cases with concrete examples:

```
Examples of what constitutes resolution:
• User: "That fixed it, thanks!" → YES
• User: "I'll try that and call back if needed" → YES
• User: "This is too complicated, forget it" → NO
• User hangs up without confirmation → NO
```

**Hierarchical decision making** — Gate the evaluation through ordered steps so ambiguous cases fall to a conservative default.

<Info>
  **Example prompt** — Issue resolution detection, using ANY-of logic with explicit YES/NO criteria:

  ```
  Given the transcript, did the assistant successfully resolve the user's primary issue or concern?

  Return YES if ANY of the following apply:
  • The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you")
  • OR the assistant provides a complete solution and the user accepts it without further objection
  • OR the user indicates satisfaction with the outcome before ending the conversation
  • OR the assistant completes a requested action and the user acknowledges success
  • OR the user's question was fully answered and they don't ask follow-up questions about the same issue
  • OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries)

  Return NO if ANY of the following apply:
  • The user states their issue remains unresolved
  • OR the conversation ends without addressing the user's main concern
  • OR the user expresses frustration or dissatisfaction with the proposed solution
  • OR the assistant escalates or transfers the issue without providing any resolution attempt
  • OR the user has to repeat their problem multiple times without progress
  • OR the assistant admits they cannot help or solve the user's problem
  ```
</Info>

## Common issues

| Issue                | Solution                                              |
| -------------------- | ----------------------------------------------------- |
| Inconsistent scoring | Add more specific criteria and examples               |
| Edge case failures   | Include explicit handling for boundary conditions     |
| LLM hallucination    | Use more structured prompts with clear constraints    |
| Low correlation      | Ensure the metric measures what you intend to measure |

Keep prompts under \~2,000 characters where possible — shorter, focused prompts are faster and more consistent.

## Next

<CardGroup cols={2}>
  <Card title="Configure metrics" icon="sliders" href="/concepts/metrics/configuring-metrics">
    Inject template variables, scope the transcript, and add trace context.
  </Card>

  <Card title="Improve with human review" icon="user-check" href="/concepts/metrics/human-review/human-review">
    Use ground-truth labels to find exactly where a prompt is wrong.
  </Card>
</CardGroup>
