Overview

During LLM application development, you’ll frequently iterate on your system to achieve optimal performance. Whether you’re tweaking model parameters, testing different model versions, or refining prompts, each change affects your system’s output quality. However, with multiple iterations, it becomes challenging to accurately quantify and compare the effectiveness of your changes. Scorecard’s A/B Comparison feature solves this by providing side-by-side run comparisons, giving you clear, data-driven insights into your improvements.
Requirements: Only runs using the same Testset can be compared with each other.

Why Use A/B Comparisons?

Data-Driven Decisions

Move beyond gut feelings with quantitative metrics that show exactly how changes impact performance.

Experiment Tracking

Easily compare different system configurations to identify the best-performing setup.

Continuous Improvement

Validate that iterative changes actually improve your system’s capabilities.

Production Confidence

Ensure changes to production systems, models, and prompts deliver better results.

How to Compare Runs

There are two ways to set up A/B comparisons in Scorecard:
Start a comparison directly from a specific run’s results page.
1

Navigate to Run Results

Go to the results page of the run you want to use as your baseline comparison.
Screenshot showing the detailed results view of a run with metrics and performance dataScreenshot showing the detailed results view of a run with metrics and performance data

Run results page showing performance metrics

2

Add Comparison

Click the “Add Comparison” button to open the comparison selector modal.
Screenshot of the comparison modal showing available runs to select for A/B testingScreenshot of the comparison modal showing available runs to select for A/B testing

Modal for selecting a run to compare against

3

Select Comparison Run

Choose the run you want to compare against from the available options. Only runs using the same testset will be available for selection.

Analyzing Comparison Results

Once you’ve set up your A/B comparison, Scorecard displays the results in an intuitive side-by-side format:
Screenshot of A/B comparison results showing two runs with their respective metrics displayed side-by-sideScreenshot of A/B comparison results showing two runs with their respective metrics displayed side-by-side

Side-by-side comparison showing metric performance differences

What You’ll See

  • Aggregated Metrics: View performance scores for both runs across all your configured metrics
  • Side-by-Side Charts: Visual representations make it easy to spot performance differences
  • Statistical Significance: Understand whether observed differences are meaningful
  • Detailed Breakdowns: Drill down into specific test cases to understand where improvements occurred
Look for consistent patterns across multiple metrics. A truly better system should show improvements across most or all of your evaluation criteria.

Best Practices

Common Use Cases

Model Version Testing

Compare performance between different model versions (e.g., GPT-4 vs GPT-4 Turbo) to understand trade-offs between cost, speed, and quality.

Prompt Engineering

Test different prompt formulations to find the most effective way to communicate instructions to your model.

Parameter Tuning

Evaluate the impact of temperature, top-p, and other model parameters on output quality and consistency.

System Architecture Changes

Compare different approaches to your LLM pipeline, such as RAG implementations, context window usage, or post-processing steps.
Remember that A/B comparisons are only as good as your metrics and testset. Ensure your evaluation criteria accurately reflect real-world performance requirements.