> ## Documentation Index > Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt > Use this file to discover all available pages before exploring further. # A/B Comparison > Compare different AI agent runs side-by-side to make data-driven decisions about model improvements, prompt optimizations, and configuration changes. export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => { const getAbsoluteUrl = src => { if (src.startsWith('http://') || src.startsWith('https://')) { return src; } const currentUrl = typeof window !== 'undefined' ? window.location.origin : ''; if (currentUrl.includes('.mintlify.app')) { const subdomain = currentUrl.split('.')[0].replace('https://', ''); return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`; } else if (currentUrl === 'https://docs.scorecard.io') { return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`; } else { return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`; } }; const content = <> {alt}

; if (caption) { return {content}; } else { return content; } }; ## Overview During AI agent development, you'll frequently iterate on your agent to achieve optimal performance. Whether you're tweaking model parameters, testing different model versions, or refining prompts, each change affects your agent's output quality. However, with multiple iterations, it becomes challenging to accurately quantify and compare the effectiveness of your changes. **Scorecard's A/B Comparison feature** solves this by providing side-by-side run comparisons, giving you clear, data-driven insights into your improvements. **Requirements:** Only runs using the same Testset can be compared with each other. ## Why Use A/B Comparisons? Move beyond gut feelings with quantitative metrics that show exactly how changes impact performance. Easily compare different agent configurations to identify the best-performing setup. Validate that iterative changes actually improve your agent's capabilities. Ensure changes to production agents, models, and prompts deliver better results. ## How to Compare Runs There are two ways to set up A/B comparisons in Scorecard: Start a comparison directly from a specific run's results page. Go to the results page of the run you want to use as your baseline comparison. Click the **"Add Comparison"** button to open the comparison selector modal. Choose the run you want to compare against from the available options. Only runs using the same testset will be available for selection. Compare multiple runs directly from the runs overview page. Go to your project's runs list page where you can see all available runs. Use the checkboxes to select two runs that you want to compare side-by-side. Click the comparison button to view the selected runs side-by-side. ## Analyzing Comparison Results Once you've set up your A/B comparison, Scorecard displays the results in an intuitive side-by-side format: ### What You'll See * **Aggregated Metrics**: View performance scores for both runs across all your configured metrics * **Side-by-Side Charts**: Visual representations make it easy to spot performance differences * **Statistical Significance**: Understand whether observed differences are meaningful * **Detailed Breakdowns**: Drill down into specific test cases to understand where improvements occurred Look for consistent patterns across multiple metrics. A truly better system should show improvements across most or all of your evaluation criteria. ## Best Practices Include multiple metrics that cover different aspects of your system (accuracy, relevance, safety, etc.) to get a complete picture of performance changes. Ensure your testset has enough examples to make statistically significant comparisons. Small testsets may lead to misleading conclusions. Keep track of what specific changes you made between runs so you can understand which modifications led to improvements. ## Common Use Cases ### Model Version Testing Compare performance between different model versions (e.g., GPT-4 vs GPT-4 Turbo) to understand trade-offs between cost, speed, and quality. ### Prompt Engineering Test different prompt formulations to find the most effective way to communicate instructions to your model. ### Parameter Tuning Evaluate the impact of temperature, top-p, and other model parameters on output quality and consistency. ### System Architecture Changes Compare different approaches to your LLM pipeline, such as RAG implementations, context window usage, or post-processing steps. Remember that A/B comparisons are only as good as your metrics and testset. Ensure your evaluation criteria accurately reflect real-world performance requirements.