The problem it solves
Every code change is tested before shipping. LLM outputs go to production with no systematic quality check. A prompt change, a model upgrade, or a configuration tweak can silently degrade output quality in ways that only show up in user complaints. Evals close that gap. You define what “good” looks like for your specific use case, and run a scored test suite against any model or prompt configuration before it ships.Creating an eval set
In the dashboard, go to Evals and create an eval set:- Name the set — e.g., “Customer Support Quality”, “Contract Extraction”
- Add items — select logged requests to include, or filter by feature/date range
- Define criteria — describe what a good response looks like for this use case
Running an eval
Select an eval set, choose the target model or prompt configuration, and run. The gateway:- Sends each item to the target model
- Passes the original request, the original response, and the new response to an LLM judge
- The judge scores each pair on your criteria (0–1) and returns reasoning
- Results are aggregated into a pass rate and score distribution
Reading results
| Metric | Description |
|---|---|
| Pass rate | % of items above your threshold (default 0.8) |
| Average score | Mean quality score across all items |
| Score distribution | Histogram — tells you if failures are clustered or spread |
| Low-scoring items | The specific requests where the new model underperformed |
Using evals in your workflow
Treat eval runs the same way you treat test runs. Before merging a prompt change or model upgrade:- Run the relevant eval set against the change
- Check the pass rate against your threshold
- Review low-scoring items
- Ship if it passes; iterate if it doesn’t