Overview
During LLM application development, you’ll frequently iterate on your system to achieve optimal performance. Whether you’re tweaking model parameters, testing different model versions, or refining prompts, each change affects your system’s output quality. However, with multiple iterations, it becomes challenging to accurately quantify and compare the effectiveness of your changes. Scorecard’s A/B Comparison feature solves this by providing side-by-side run comparisons, giving you clear, data-driven insights into your improvements.Requirements: Only runs using the same Testset can be compared with each other.
Why Use A/B Comparisons?
Data-Driven Decisions
Move beyond gut feelings with quantitative metrics that show exactly how changes impact performance.
Experiment Tracking
Easily compare different system configurations to identify the best-performing setup.
Continuous Improvement
Validate that iterative changes actually improve your system’s capabilities.
Production Confidence
Ensure changes to production systems, models, and prompts deliver better results.
How to Compare Runs
There are two ways to set up A/B comparisons in Scorecard:Start a comparison directly from a specific run’s results page.
1
Navigate to Run Results
Go to the results page of the run you want to use as your baseline comparison.



Run results page showing performance metrics
2
Add Comparison
Click the “Add Comparison” button to open the comparison selector modal.



Modal for selecting a run to compare against
3
Select Comparison Run
Choose the run you want to compare against from the available options. Only runs using the same testset will be available for selection.
Analyzing Comparison Results
Once you’ve set up your A/B comparison, Scorecard displays the results in an intuitive side-by-side format:

Side-by-side comparison showing metric performance differences
What You’ll See
- Aggregated Metrics: View performance scores for both runs across all your configured metrics
- Side-by-Side Charts: Visual representations make it easy to spot performance differences
- Statistical Significance: Understand whether observed differences are meaningful
- Detailed Breakdowns: Drill down into specific test cases to understand where improvements occurred
Look for consistent patterns across multiple metrics. A truly better system should show improvements across most or all of your evaluation criteria.
Best Practices
Use Comprehensive Metrics
Use Comprehensive Metrics
Include multiple metrics that cover different aspects of your system (accuracy, relevance, safety, etc.) to get a complete picture of performance changes.
Test with Sufficient Data
Test with Sufficient Data
Ensure your testset has enough examples to make statistically significant comparisons. Small testsets may lead to misleading conclusions.
Document Your Changes
Document Your Changes
Keep track of what specific changes you made between runs so you can understand which modifications led to improvements.
Common Use Cases
Model Version Testing
Compare performance between different model versions (e.g., GPT-4 vs GPT-4 Turbo) to understand trade-offs between cost, speed, and quality.Prompt Engineering
Test different prompt formulations to find the most effective way to communicate instructions to your model.Parameter Tuning
Evaluate the impact of temperature, top-p, and other model parameters on output quality and consistency.System Architecture Changes
Compare different approaches to your LLM pipeline, such as RAG implementations, context window usage, or post-processing steps.Remember that A/B comparisons are only as good as your metrics and testset. Ensure your evaluation criteria accurately reflect real-world performance requirements.