Skip to main content
Evals let you build a test suite from your real logged requests, define quality criteria, and run scored evaluations against any model — before deploying a change.

The problem it solves

Every code change is tested before shipping. LLM outputs go to production with no systematic quality check. A prompt change, a model upgrade, or a configuration tweak can silently degrade output quality in ways that only show up in user complaints. Evals close that gap. You define what “good” looks like for your specific use case, and run a scored test suite against any model or prompt configuration before it ships.

Creating an eval set

In the dashboard, go to Evals and create an eval set:
  1. Name the set — e.g., “Customer Support Quality”, “Contract Extraction”
  2. Add items — select logged requests to include, or filter by feature/date range
  3. Define criteria — describe what a good response looks like for this use case
Eval sets are built from your real production requests — not synthetic examples. This means your tests reflect the actual distribution of inputs your users send.

Running an eval

Select an eval set, choose the target model or prompt configuration, and run. The gateway:
  1. Sends each item to the target model
  2. Passes the original request, the original response, and the new response to an LLM judge
  3. The judge scores each pair on your criteria (0–1) and returns reasoning
  4. Results are aggregated into a pass rate and score distribution

Reading results

MetricDescription
Pass rate% of items above your threshold (default 0.8)
Average scoreMean quality score across all items
Score distributionHistogram — tells you if failures are clustered or spread
Low-scoring itemsThe specific requests where the new model underperformed
Low-scoring items are the most valuable output. Read through them to understand why the model failed — is it a specific input type, a prompt edge case, or a systematic quality difference?

Using evals in your workflow

Treat eval runs the same way you treat test runs. Before merging a prompt change or model upgrade:
  1. Run the relevant eval set against the change
  2. Check the pass rate against your threshold
  3. Review low-scoring items
  4. Ship if it passes; iterate if it doesn’t
The eval set grows over time as you add items from new edge cases or failures you find in production. The more representative the set, the more confident you can be in the results.