LLM Judge Metrics

LLM Judge metrics ask a language model to evaluate a conversation against a question you write. They are the most flexible metric type: you supply the prompt and the judge returns a yes/no answer, a category, or a numeric score. Audio variants evaluate the recording directly instead of the transcript. See Write judge prompts for prompt-writing guidance shared across these metrics — the Configuration section of each entry only notes what you supply. Where Scope notes Trace context, the judge can also receive a summary of the agent’s OTel spans — tool calls, execution order, internal steps — alongside the transcript.

Metric	Tells you	Unit	Scope
Audio Binary LLM Judge	A yes/no question judged on the recording	YES / NO / UNKNOWN	Audio
Audio Categorical LLM Judge	Which category fits an audio quality	category	Audio
Audio Numerical LLM Judge	A graded score judged on the recording	score	Audio
Binary LLM Judge	The answer to your yes/no question	YES / NO / UNKNOWN	Transcript + Trace context
Categorical LLM Judge	Which of your categories fits the conversation	category	Transcript + Trace context
Composite Evaluation	Fraction of your criteria the conversation meets	0–1	Transcript + Trace context
Numerical LLM Judge	A graded score for your question	score	Transcript + Trace context

Binary LLM Judge

Evaluates a conversation transcript against a user-defined yes/no question using an LLM judge. This is the workhorse custom metric — best when a behavior is clearly either present or absent. What it measures — Whether the behavior described in your question is present in the transcript, expressed as a yes/no determination. When to use — Objective pass/fail checks: “Did the assistant verify identity before discussing the account?”, “Did the assistant resolve the user’s primary issue?”, “Did the assistant acknowledge the user’s concern within its first two responses?” Keep each metric to a single observation — split “resolved the issue and stayed professional” into two metrics. How it works — The judge reads the transcript and your question, then returns YES, NO, or UNKNOWN based on the criteria you define for each outcome. When criteria use OR logic, state explicitly that ANY matching condition triggers the result, so the judge doesn’t silently require all of them. How to interpret — YES means the criteria were met, NO means they were not, and UNKNOWN means the transcript lacks enough evidence to decide. A high UNKNOWN rate usually signals a vague question — tighten the YES/NO conditions with concrete, observable behaviors. Configuration — A prompt stating the yes/no question and the explicit criteria for returning YES versus NO (use clear AND/OR logic and handle edge cases), plus an optional judge model selection.

Example — “Avoid Unresponsiveness”:Given the transcript, did the assistant maintain responsiveness by acknowledging all user inputs and avoiding behaviors that make the user question whether the assistant is still present?Return YES if: • The assistant responds promptly and appropriately to all user inputs • There are no long silences, skipped questions, or ignored user messages • The user does not need to ask “Are you still there?” or similar promptsReturn NO if: • The assistant fails to respond to a user input • The user asks “Are you still there?” or expresses concern about being ignored

Requires Transcript · Unit YES / NO / UNKNOWN · Scope Transcript + Trace context

Categorical LLM Judge

Evaluates a conversation transcript against a user-defined question using an LLM judge and returns exactly one category from a configured list. Useful when the answer is a label rather than a pass/fail or a score. What it measures — Which single category, from a set you define, best classifies the conversation. When to use — Classification tasks like call-intent labeling (technical support, billing inquiry, account management, complaint escalation) or outcome tagging (resolved, partially resolved, escalated to human, abandoned) — especially for exploratory analysis of what kinds of calls your agent handles. How it works — The judge reads the transcript and your decision logic — a list of “if condition, classify as CATEGORY” rules — then returns the one category name that matches. How to interpret — The returned value is the category the judge selected; categories are mutually exclusive, so each conversation maps to exactly one. If many conversations land in a catch-all bucket or the judge oscillates between two labels, your category definitions overlap or are missing a real class. Configuration — A prompt with decision logic mapping conditions to categories, the category options and their definitions set in the platform’s category menu (not the prompt text), and an optional judge model selection. Requires Transcript · Unit one configured category · Scope Transcript + Trace context

Numerical LLM Judge

Evaluates a conversation transcript against a user-defined question using an LLM judge and returns a numeric score within a configured min–max range. Reach for it when “how well” matters more than “whether”. What it measures — A graded score for the aspect described in your question, on the scale you configure. When to use — Graded qualities such as rating the assistant’s empathy or professionalism on a 1–5 scale, or scoring the technical accuracy and completeness of its explanations — anything with degrees of success rather than a clean yes/no. How it works — The judge reads the transcript and your scoring guidelines, then returns a single number within the configured range. Anchoring both ends of the scale to concrete behavioral indicators (what low scores look like, what high scores look like) keeps scoring consistent across evaluations. How to interpret — Higher or lower scores carry the meaning you assign in your scoring guidelines; compare against your defined low- and high-score behaviors. Scores are most useful as trends across runs — a drop after a prompt change is a stronger signal than any single absolute value. Configuration — A prompt with evaluation criteria and scoring guidelines, the Min and Max score values set in the platform interface (not the prompt text), and an optional judge model selection. Requires Transcript · Unit numeric score within the configured range · Scope Transcript + Trace context

Audio Binary LLM Judge

Evaluates an audio recording against a user-defined yes/no question using an LLM judge. The audio counterpart to the Binary LLM Judge, for qualities that text analysis cannot capture. What it measures — Whether an audio-only quality described in your question — tone, clarity, pacing, stuttering — is present in the recording. When to use — Yes/no questions only the audio can answer: “Did the assistant stutter?”, “Did the assistant speak clearly and at an appropriate pace?”, “Did the assistant maintain a professional vocal tone?” If the transcript alone can answer it (word choice, script compliance), use the text Binary LLM Judge instead — it is faster and cheaper. How it works — The judge listens to the recording (not the transcript) and your question, then returns YES, NO, or UNKNOWN. Always specify whose audio to evaluate — the assistant, the user, or both — so the judge isn’t ambiguous when multiple voices are present. How to interpret — YES means the audio criteria were met, NO means they were not, and UNKNOWN means there is not enough evidence to decide. Vague criteria like “good tone” produce inconsistent results — frequent UNKNOWNs or flip-flopping verdicts mean you should restate the question in observable acoustic terms (e.g., “calm, even-paced tone without audible frustration”). Configuration — A prompt stating the yes/no question and concrete acoustic criteria for YES versus NO, plus an optional audio scope to restrict evaluation to one speaker (defaults to the full recording). Requires Audio · Unit YES / NO / UNKNOWN · Scope Audio

Audio Categorical LLM Judge

Evaluates an audio recording against a user-defined question using an LLM judge and returns exactly one category from a configured list. The audio counterpart to the Categorical LLM Judge. What it measures — Which single category, from a set you define, best classifies an audio quality of the recording. When to use — Classifying vocal qualities into one bucket: labeling the assistant’s overall tone (e.g., warm, neutral, dismissive), the dominant emotion audible in the user’s voice (enthusiasm, frustration, sarcasm), or the recording’s audio quality (clear, muffled, distorted). How it works — The judge listens to the recording and your decision logic, then returns the one matching category name. Specify whose audio each rule applies to and define categories by observable acoustic traits, not just labels. How to interpret — The returned value is the selected category; categories are mutually exclusive, so each recording maps to exactly one. Inconsistent labels across similar recordings usually mean the category definitions describe content rather than sound — anchor them to acoustic qualities like pace, volume, and inflection. Configuration — A prompt with decision logic, the category options and definitions set in the platform’s category menu, and an optional audio scope to restrict evaluation to one speaker. Requires Audio · Unit one configured category · Scope Audio

Audio Numerical LLM Judge

Evaluates an audio recording against a user-defined question using an LLM judge and returns a numeric score within a configured min–max range. The audio counterpart to the Numerical LLM Judge. What it measures — A graded score for an audio-only quality of the recording, on the scale you configure. When to use — Grading vocal qualities by degree: rating speech clarity or pronunciation 1–5, scoring vocal empathy (does the assistant’s tone audibly soften when the user expresses frustration?), or rating professional vocal demeanor across a call. How it works — The judge listens to the recording and your scoring guidelines, then returns a single number within the configured range. Anchor low and high scores to concrete acoustic indicators — mumbled or rushed speech versus distinct pronunciation at a comfortable pace — and specify whose audio to grade. How to interpret — Higher or lower scores carry the meaning you assign in your scoring guidelines; compare against your defined low- and high-score behaviors. As with other graded metrics, trends across runs and voice configurations are more informative than a single score. Configuration — A prompt with audio analysis criteria and scoring guidelines, the Min and Max score values set in the platform interface, and an optional audio scope to restrict evaluation to one speaker. Requires Audio · Unit numeric score within the configured range · Scope Audio

Composite Evaluation

Evaluates a conversation transcript against a list of criteria and reports the fraction met among the criteria the judge could evaluate. One metric covers an entire checklist instead of requiring a separate binary metric per item. What it measures — How many of several requirements a conversation satisfies at once — for example, whether the agent greeted the customer, verified identity, and offered a resolution. When to use — Checklist-style evaluations: did the response cover all required talking points, did the call follow each step of a compliance checklist, or did each test case meet its own Expected Behaviors. Pull criteria from test cases when expectations differ per scenario; use a static list when every conversation must meet the same bar. How it works — The judge evaluates each criterion independently, marking it MET, NOT_MET, or UNKNOWN, then scores the fraction met among the criteria it could evaluate (UNKNOWN criteria are excluded). The judge matches semantically, so equivalent phrasings count, but it cannot infer intent from ambiguous criteria. How to interpret — The score is 0–1, where 1.0 means every evaluated criterion was satisfied. A per-criterion breakdown shows which passed or failed and why. Frequent UNKNOWN results usually mean a criterion is too vague — rewrite it as actor + specific action + specific outcome (e.g., “Agent confirms the appointment date, time, and provider name” rather than “Agent schedules the appointment”). Configuration — A criteria list (static, or pulled from each test case’s Expected Behaviors), an optional custom evaluation prompt giving the judge domain context, and a reporting method: fraction of criteria met (default), count met, or all-criteria-met. Requires Transcript · Unit score from 0–1 (fraction of evaluated criteria met) · Scope Transcript + Trace context

Simulate

Evaluate

Observe

Review

Cookbooks

Sofia

Reference

Binary LLM Judge

Categorical LLM Judge

Numerical LLM Judge

Audio Binary LLM Judge

Audio Categorical LLM Judge

Audio Numerical LLM Judge

Composite Evaluation

​Binary LLM Judge

​Categorical LLM Judge

​Numerical LLM Judge

​Audio Binary LLM Judge

​Audio Categorical LLM Judge

​Audio Numerical LLM Judge

​Composite Evaluation

Binary LLM Judge

Categorical LLM Judge

Numerical LLM Judge

Audio Binary LLM Judge

Audio Categorical LLM Judge

Audio Numerical LLM Judge

Composite Evaluation