Evals

Evals let you build a test suite from your real logged requests, define quality criteria, and run scored evaluations against any model — before deploying a change.

The problem it solves

Every code change is tested before shipping. LLM outputs go to production with no systematic quality check. A prompt change, a model upgrade, or a configuration tweak can silently degrade output quality in ways that only show up in user complaints. Evals close that gap. You define what “good” looks like for your specific use case, and run a scored test suite against any model or prompt configuration before it ships.

Creating an eval set

In the dashboard, go to Evals and create an eval set:

Name the set — e.g., “Customer Support Quality”, “Contract Extraction”
Add items — select logged requests to include, or filter by feature/date range
Define criteria — describe what a good response looks like for this use case

Eval sets are built from your real production requests — not synthetic examples. This means your tests reflect the actual distribution of inputs your users send.

Running an eval

Select an eval set, choose the target model or prompt configuration, and run. The gateway:

Sends each item to the target model
Passes the original request, the original response, and the new response to an LLM judge
The judge scores each pair on your criteria (0–1) and returns reasoning
Results are aggregated into a pass rate and score distribution

Reading results

Metric	Description
Pass rate	% of items above your threshold (default 0.8)
Average score	Mean quality score across all items
Score distribution	Histogram — tells you if failures are clustered or spread
Low-scoring items	The specific requests where the new model underperformed

Low-scoring items are the most valuable output. Read through them to understand why the model failed — is it a specific input type, a prompt edge case, or a systematic quality difference?

Using evals in your workflow

Treat eval runs the same way you treat test runs. Before merging a prompt change or model upgrade:

Run the relevant eval set against the change
Check the pass rate against your threshold
Review low-scoring items
Ship if it passes; iterate if it doesn’t

The eval set grows over time as you add items from new edge cases or failures you find in production. The more representative the set, the more confident you can be in the results.

Getting Started

How It Works

Connect Your Code

Track & Analyze

Control

Enterprise

Reference

The problem it solves

Creating an eval set

Running an eval

Reading results

Using evals in your workflow

Getting Started

How It Works

Connect Your Code

Track & Analyze

Control

Enterprise

Reference

​The problem it solves

​Creating an eval set

​Running an eval

​Reading results

​Using evals in your workflow

The problem it solves

Creating an eval set

Running an eval

Reading results

Using evals in your workflow