Skip to main content
Replay lets you run a set of real production requests against a different model and compare the results side by side — cost, latency, and output quality — before committing to a switch.

The problem it solves

A new model looks good in the playground. Benchmarks look promising. But you have no idea how it performs on your production traffic — the actual prompts, system prompts, conversation histories, and edge cases your users send. Replay runs your real traffic against the candidate model. You get actual numbers on your workload, not synthetic benchmarks.

Creating a replay run

In the dashboard, go to Replay and create a new run:
  1. Select a source — choose an API key, date range, and optionally filter by feature or metadata
  2. Choose the target model — the model you want to test
  3. Configure the judge — optional LLM judge for automated quality scoring
  4. Run — the gateway re-executes each selected request against the target model

Reading results

For each replayed request you see:
OriginalReplay
Modelgpt-4ogpt-4o-mini
Input tokens1,2401,240
Output tokens384312
Cost$0.0142$0.0008
Latency2,100ms890ms
Quality score0.92
The quality score is produced by an LLM judge that compares the original and replay responses and returns a 0–1 equivalence score plus reasoning.

Interpreting quality scores

A score of 0.9+ generally means the cheaper model produces outputs that are functionally equivalent for your use case. A score of 0.7–0.9 means similar outputs with some degradation — review the low-scoring requests manually to understand where the gaps are. Low scores on specific request types often reveal where the cheaper model struggles (complex reasoning, long context, specific formatting). That tells you whether to switch fully, switch for a subset of traffic, or not switch at all.

The decision framework

  1. Run replay on a representative sample (500–1,000 requests is usually enough)
  2. Check aggregate cost and latency savings
  3. Review the quality score distribution
  4. Read through 10–20 low-scoring pairs manually
  5. If the failure modes are acceptable for your use case, switch
The goal isn’t a perfect score — it’s understanding where the model differs and deciding if those differences matter for your product.