Why human review
Human review is how you know your metrics can be trusted. Reviewers label real conversations with ground-truth values; Coval compares those labels against each metric’s output and shows you exactly where they agree and disagree.- Validate your metrics — every reviewed metric gets an agreement rate (human vs. machine), so you know which scores to trust before acting on them.
- Improve your prompts — disagreements pinpoint the exact conversations an LLM judge gets wrong. Ground-truth labels are the strongest signal for refining a judge prompt.
- Build ground truth at scale — projects auto-add new runs, notify assignees by email, and track completion, so labeling becomes a routine instead of a one-off effort.
Review projects
A review project scopes a labeling effort: pick the metrics to validate, the simulations to label, and the reviewers. Coval generates an annotation task for every combination and tracks progress per project and per assignee. Two project modes:- Individual (default) — each reviewer labels independently in a private queue. Use this to measure inter-annotator agreement or collect multiple perspectives.
- Collaborative — reviewers share one queue and produce a single canonical label per simulation × metric. Use this to divide labeling work across a team without duplication.
- Auto-add new runs — link incoming simulations to the project automatically based on your configured rules; assignees get an email when new work arrives.
- Progress tracking — per-project and per-assignee completion, visible from the Projects tab.
- Disagreement notes — optionally require reviewers to explain whenever they disagree with the metric’s output, so every disagreement comes with a reason you can act on.
- Duplicate a project to reuse its metrics, assignees, and settings for the next batch.
Any conversation can also be annotated ad hoc from its results page — no project required. Projects add assignment, progress tracking, and agreement analytics on top.
Agreement insights
Once labels come in, Coval turns them into metric diagnostics:- Per-metric agreement rate — how often the metric matched human ground truth, with a drill-down of the exact simulations that disagreed.
- Inter-annotator agreement — for Individual projects, how consistently your reviewers agree with each other. Low human agreement usually means the metric’s definition is ambiguous — fix the criteria before fixing the prompt.
- Agreement on the metric page — each reviewed metric surfaces its agreement stats where you edit it, so you can judge and improve it in one place.
The improvement loop
- Label — reviewers provide ground truth through a review project.
- Measure — check the metric’s agreement rate against the human labels.
- Diagnose — read the disagreeing conversations and reviewer notes to see what the metric misses.
- Revise — tighten the prompt for exactly those cases.
- Re-test — run the new version against your ground-truth labels and confirm agreement improved.
Repeat the loop until agreement plateaus. A metric validated this way gives you a trustworthy, automated stand-in for human judgment on every future run.
Supported Metric Types
Not all metrics support human review — only those with a defined annotation mechanism can be labeled in the review interface. Metrics fall into four categories based on how reviewers interact with them.Direct Value Metrics
Reviewers provide a single value for the entire conversation using buttons, a number input, or a dropdown.Binary (Pass/Fail)
Reviewers select Yes, No, or N/A using on-screen buttons or keyboard shortcuts.- Applies to: binary LLM judge metrics, audio binary judge, agent repeats itself
Numerical
Reviewers enter a number within a configured min/max range.- Applies to: numerical LLM judge, audio numerical judge
Categorical
Reviewers select from a configured list of categories using a dropdown.- Applies to: categorical LLM judge, audio categorical judge
Transcript Sentiment Analysis
Reviewers select a sentiment label (e.g. Rude, Polite, Encouraging, Professional) using category buttons.Composite Evaluation
Reviewers assess each criterion individually using MET / NOT_MET / UNKNOWN toggles.Audio Region Metrics
Reviewers mark or edit regions on an audio waveform timeline. These metrics require an audio recording to be present on the conversation. Includes: interruption rate, latency, abrupt pitch changes, volume/pitch misalignment, non-expressive pauses, vocal fry, music detection, time to first audio, volume variance, custom pause analysis, agent needs reprompting.Per-Segment Labeling
Reviewers assign a label to each speaking segment in the conversation.- Audio sentiment — label each segment as Neutral, Angry, Happy, or Sad
Per-Message Review
Reviewers provide a value for each individual message in the transcript.- Words per message — count of words per assistant message
Keyboard-first reviewing
The review interface is built for fast labeling — navigate rows withj / k, open with Enter, and move between neighboring conversations with h / l. See the Keyboard Navigation guide for the full model.
