> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt
> Use this file to discover all available pages before exploring further.

# A/B Comparison

> Compare different AI agent runs side-by-side to make data-driven decisions about model improvements, prompt optimizations, and configuration changes.

export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => {
  const getAbsoluteUrl = src => {
    if (src.startsWith('http://') || src.startsWith('https://')) {
      return src;
    }
    const currentUrl = typeof window !== 'undefined' ? window.location.origin : '';
    if (currentUrl.includes('.mintlify.app')) {
      const subdomain = currentUrl.split('.')[0].replace('https://', '');
      return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`;
    } else if (currentUrl === 'https://docs.scorecard.io') {
      return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`;
    } else {
      return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`;
    }
  };
  const content = <>
      <img className="block dark:hidden" width={width} src={getAbsoluteUrl(lightSrc)} alt={alt} />
      <img className="hidden dark:block" width={width} src={getAbsoluteUrl(darkSrc || lightSrc.replace('light', 'dark'))} alt={alt} />
    </>;
  if (caption) {
    return <Frame caption={caption}>{content}</Frame>;
  } else {
    return content;
  }
};

## Overview

During AI agent development, you'll frequently iterate on your agent to achieve optimal performance. Whether you're tweaking model parameters, testing different model versions, or refining prompts, each change affects your agent's output quality.

However, with multiple iterations, it becomes challenging to accurately quantify and compare the effectiveness of your changes. **Scorecard's A/B Comparison feature** solves this by providing side-by-side run comparisons, giving you clear, data-driven insights into your improvements.

<Note>
  **Requirements:** Only runs using the same Testset can be compared with each other.
</Note>

## Why Use A/B Comparisons?

<CardGroup cols={2}>
  <Card title="Data-Driven Decisions" icon="chart-line">
    Move beyond gut feelings with quantitative metrics that show exactly how changes impact performance.
  </Card>

  <Card title="Experiment Tracking" icon="flask-conical">
    Easily compare different agent configurations to identify the best-performing setup.
  </Card>

  <Card title="Continuous Improvement" icon="arrow-up-right">
    Validate that iterative changes actually improve your agent's capabilities.
  </Card>

  <Card title="Production Confidence" icon="shield-check">
    Ensure changes to production agents, models, and prompts deliver better results.
  </Card>
</CardGroup>

## How to Compare Runs

There are two ways to set up A/B comparisons in Scorecard:

<Tabs>
  <Tab title="From Run Details">
    Start a comparison directly from a specific run's results page.

    <Steps>
      <Step title="Navigate to Run Results">
        Go to the results page of the run you want to use as your baseline comparison.

        <DarkLightImage lightSrc="/images/run-details-light.png" darkSrc="/images/run-details-dark.png" caption="Run results page showing performance metrics" alt="Screenshot showing the detailed results view of a run with metrics and performance data" />
      </Step>

      <Step title="Add Comparison">
        Click the **"Add Comparison"** button to open the comparison selector modal.

        <DarkLightImage lightSrc="/images/a-b/ab-compare-modal-light.png" darkSrc="/images/a-b/ab-compare-modal-dark.png" caption="Modal for selecting a run to compare against" alt="Screenshot of the comparison modal showing available runs to select for A/B testing" />
      </Step>

      <Step title="Select Comparison Run">
        Choose the run you want to compare against from the available options. Only runs using the same testset will be available for selection.
      </Step>
    </Steps>
  </Tab>

  <Tab title="From Runs List">
    Compare multiple runs directly from the runs overview page.

    <Steps>
      <Step title="Navigate to Runs List">
        Go to your project's runs list page where you can see all available runs.
      </Step>

      <Step title="Select Runs to Compare">
        Use the checkboxes to select two runs that you want to compare side-by-side.

        <DarkLightImage lightSrc="/images/a-b/ab-list-light.png" darkSrc="/images/a-b/ab-list-dark.png" caption="Runs list with selection checkboxes for A/B comparison" alt="Screenshot of the runs list page showing multiple runs with selection checkboxes for comparison" />
      </Step>

      <Step title="Start Comparison">
        Click the comparison button to view the selected runs side-by-side.
      </Step>
    </Steps>
  </Tab>
</Tabs>

## Analyzing Comparison Results

Once you've set up your A/B comparison, Scorecard displays the results in an intuitive side-by-side format:

<DarkLightImage lightSrc="/images/a-b/ab-light.png" darkSrc="/images/a-b/ab-dark.png" caption="Side-by-side comparison showing metric performance differences" alt="Screenshot of A/B comparison results showing two runs with their respective metrics displayed side-by-side" />

### What You'll See

* **Aggregated Metrics**: View performance scores for both runs across all your configured metrics
* **Side-by-Side Charts**: Visual representations make it easy to spot performance differences
* **Statistical Significance**: Understand whether observed differences are meaningful
* **Detailed Breakdowns**: Drill down into specific test cases to understand where improvements occurred

<Tip>
  Look for consistent patterns across multiple metrics. A truly better system should show improvements across most or all of your evaluation criteria.
</Tip>

## Best Practices

<AccordionGroup>
  <Accordion title="Use Comprehensive Metrics">
    Include multiple metrics that cover different aspects of your system (accuracy, relevance, safety, etc.) to get a complete picture of performance changes.
  </Accordion>

  <Accordion title="Test with Sufficient Data">
    Ensure your testset has enough examples to make statistically significant comparisons. Small testsets may lead to misleading conclusions.
  </Accordion>

  <Accordion title="Document Your Changes">
    Keep track of what specific changes you made between runs so you can understand which modifications led to improvements.
  </Accordion>
</AccordionGroup>

## Common Use Cases

### Model Version Testing

Compare performance between different model versions (e.g., GPT-4 vs GPT-4 Turbo) to understand trade-offs between cost, speed, and quality.

### Prompt Engineering

Test different prompt formulations to find the most effective way to communicate instructions to your model.

### Parameter Tuning

Evaluate the impact of temperature, top-p, and other model parameters on output quality and consistency.

### System Architecture Changes

Compare different approaches to your LLM pipeline, such as RAG implementations, context window usage, or post-processing steps.

<Warning>
  Remember that A/B comparisons are only as good as your metrics and testset. Ensure your evaluation criteria accurately reflect real-world performance requirements.
</Warning>
