> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Playground

> Test agents against testcases and score results with metrics — all in one visual workspace.

export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => {
  const getAbsoluteUrl = src => {
    if (src.startsWith('http://') || src.startsWith('https://')) {
      return src;
    }
    const currentUrl = typeof window !== 'undefined' ? window.location.origin : '';
    if (currentUrl.includes('.mintlify.app')) {
      const subdomain = currentUrl.split('.')[0].replace('https://', '');
      return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`;
    } else if (currentUrl === 'https://docs.scorecard.io') {
      return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`;
    } else {
      return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`;
    }
  };
  const content = <>
      <img className="block dark:hidden" width={width} src={getAbsoluteUrl(lightSrc)} alt={alt} />
      <img className="hidden dark:block" width={width} src={getAbsoluteUrl(darkSrc || lightSrc.replace('light', 'dark'))} alt={alt} />
    </>;
  if (caption) {
    return <Frame caption={caption}>{content}</Frame>;
  } else {
    return content;
  }
};

The **Playground** lets you wire up testcases, an agent, and metrics in a single workspace, then run everything end-to-end. Results and scores appear inline so you can iterate without leaving the page.

<DarkLightImage lightSrc="/images/playground-light.png" darkSrc="/images/playground-dark.png" caption="Playground overview." alt="Screenshot of the Playground showing testcases, agent, results, and scores." />

## How It Works

The Playground is laid out as a left-to-right flow:

1. **Testcases** (left) — the inputs and expected outputs your agent will be tested against
2. **Agent** (center) — the prompt and settings (temperature, maximum length, etc.) that define your agent's behavior
3. **Results** (center-right) — the agent's actual responses for each testcase
4. **Evaluator → Scores** (right) — metrics score each result and show pass/fail with reasoning

Click **RUN** to execute the full flow.

## Testcases

Select a testset from the dropdown at the top of the left panel. The testcases in that testset appear as cards below, each summarizing its input fields. Click **+ Add testcases** to create new ones directly in the Playground.

## Agent

The Agent node is where you configure what gets sent to the model.

* **Prompt tab** — write your prompt using Jinja syntax. Reference testcase fields with `{{all.inputs}}` or specific fields like `{{inputs.query}}`.
* **Settings tab** — choose the model, temperature, and other parameters.
* **Messages** — click **+ ADD MESSAGE** to add messages and set roles (System, User, Assistant).

The version indicator (e.g. "V1 Prod") shows which agent version you're editing.

## Results

After a run, each testcase gets a result card showing the agent's response. Flow lines connect each testcase to its corresponding result.

## Evaluator and Scores

The **Evaluator** node in the top-right holds your metrics. Click it to configure which metrics to use and how many are attached (e.g. "1 METRICS").

After scoring completes, each result gets a score card on the far right showing:

* **Pass/Fail** status per metric
* **Score** value (e.g. 3/5)
* **Reasoning** explaining why the metric scored the way it did

<Tip>
  Update a metric's guidelines and re-run to see how scoring changes — no need to re-execute the agent.
</Tip>

## Workflows

**Iterate on a prompt:**
Configure agent → RUN → review scores → adjust prompt → re-run
Start here when your outputs are close but inconsistent. Use score reasoning to pinpoint which instruction or example to refine before your next run.

**Tune metrics:**
RUN → read score reasoning → update metric guidelines → re-run
Use this workflow when agent behavior looks right but grading feels off. Tightening guidelines helps metrics align with your real quality bar.

**Expand test coverage:**
Review scores → add edge-case testcases → RUN → verify
Use failures and near-misses to identify gaps in your dataset. Adding targeted edge cases improves confidence that your agent generalizes beyond happy paths.
