A single run shows how your agent performed in one configuration. A Report combines multiple runs into one view so you can compare them — grouped and color-coded by a dimension you choose, with aggregated statistics and a shareable link.
Creating a report
Reports are built from the Runs page, in one of two ways:
- From selected runs — turn on select mode, check the runs you want to compare (at least two), and click Create Report From Selection in the action bar.
- From the current filters — with filters applied to the runs list, click Multi-run report to build a report from every run matching those filters.
A report includes up to 50 runs; if more match, only the first 50 are used.
You can also create a report from a scheduled run: open the scheduled run and click Create Report, then choose a timeframe (past 24 hours, week, 30 days, or a custom range) to pull its runs into a report.
Designing the comparison
A report is only as useful as the runs in it. Change one variable at a time and keep the others constant.
To compare two agents, run both against the same personas and the same test sets. Any difference in scores can then be attributed to the agent rather than to the personas or test cases. A test matrix is a reliable way to set this up: choose the variable you’re testing, then run every combination of the other dimensions across both.
| Persona: Calm | Persona: Impatient |
|---|
| Agent A | run | run |
| Agent B | run | run |
From a matrix like this, comparing by Agent isolates the agent and comparing by Persona isolates the persona, using the same set of runs.
Compare By
The Compare by dropdown groups the simulations by a dimension and lines each metric up across the groups.
| Compare by | Groups simulations by… |
|---|
| None | Nothing — all rows shown together |
| Run | The run each simulation belongs to |
| Agent | The agent that ran the simulation |
| Mutation | The agent mutation applied |
| Persona | The persona (simulated user) |
| Test case | The specific test case input |
| Metadata | A custom metadata key you choose |
| + Create dimension | A custom dimension you define — group runs however you like |
The mechanic is the same for each option; only the dimension changes. Compare by Agent to see where one build differs from another, by Persona to see how different users affect results, by Metadata to group on a key you set at launch (such as environment, version, or region), or create a custom dimension to group runs yourself — for example, grouping several builds as “v1” and “v2”.
You can also add a secondary Compare By to nest one dimension inside another — for example, group by Agent, then by Persona within each agent.
Reading the comparison
Row vs. grouped view — Row view (default) shows each simulation as its own row, color-coded by group. Grouped view collapses each group into one aggregated row; click a group to expand it.
Aggregation — In grouped view, choose how each group is summarized: Average, Median, P95, Min, or Max. P95 and Min are useful for understanding worst-case results.
Focus on one metric — Click a metric card to filter the table to that metric; click All Metrics to return.
Saving, sharing, and deleting
Click Save Report to store the runs and view configuration; rename it with the pencil icon. To share, open the report and click Share → Publish shareable link to generate a public URL that can be viewed without an account. Published reports show a Public badge, and Unpublish all revokes access. To remove a report, use the three-dot menu → Delete; this does not delete the underlying runs.
Links copied from the Reports list won’t open until the report is published.