> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Metrics Quickstart

> Create metrics, group them, run evaluations, and read scores.

export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => {
  const getAbsoluteUrl = src => {
    if (src.startsWith('http://') || src.startsWith('https://')) {
      return src;
    }
    const currentUrl = typeof window !== 'undefined' ? window.location.origin : '';
    if (currentUrl.includes('.mintlify.app')) {
      const subdomain = currentUrl.split('.')[0].replace('https://', '');
      return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`;
    } else if (currentUrl === 'https://docs.scorecard.io') {
      return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`;
    } else {
      return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`;
    }
  };
  const content = <>
      <img className="block dark:hidden" width={width} src={getAbsoluteUrl(lightSrc)} alt={alt} />
      <img className="hidden dark:block" width={width} src={getAbsoluteUrl(darkSrc || lightSrc.replace('light', 'dark'))} alt={alt} />
    </>;
  if (caption) {
    return <Frame caption={caption}>{content}</Frame>;
  } else {
    return content;
  }
};

<DarkLightImage lightSrc="/images/metrics/metrics-overview-light.png" darkSrc="/images/metrics/metrics-overview-dark.png" caption="Metrics page with Metrics, Groups, and Templates tabs." alt="Metrics page overview" />

Metrics define what “good” looks like for your LLM. You describe the criteria (e.g., helpfulness, groundedness, safety), and Scorecard turns that into repeatable scores you can track across runs and over time.

<Steps>
  <Step title="Open Metrics and explore templates">
    Go to your project’s **Metrics** page. Start fast by copying a proven template, then tailor the guidelines to your domain.

    <DarkLightImage lightSrc="/images/metrics/metrics-templates-dark.png" darkSrc="/images/metrics/metrics-templates-light.png" caption="Templates list with Create from Template." alt="Metric templates list" />
  </Step>

  <Step title="Create a metric">
    You can also create a metric from scratch. Provide a name, description, clear <strong>guidelines</strong>, and choose an <strong>Evaluation Type</strong> and <strong>Output Type</strong>.

    <Note>
      <strong>Guidelines matter.</strong> Describe what to reward and what to penalize, and include 1–2 concise examples if helpful. These instructions become the core of the evaluator prompt.
    </Note>

    <Tabs>
      <Tab title="AI‑scored">
        Uses a model to apply your guidelines consistently and at scale. Pick the evaluator model and keep temperature low for repeatability.

        <DarkLightImage lightSrc="/images/metrics/metrics-ai-detail-light.png" darkSrc="/images/metrics/metrics-ai-detail-dark.png" caption="AI metric – evaluator model, output type, and evaluation guidelines." alt="AI metric detail with model and output type settings" />
      </Tab>

      <Tab title="Human‑scored">
        Best for nuanced judgments or gold‑standard baselines. Select <strong>Human</strong> as the evaluation type and write clear instructions for reviewers.

        <DarkLightImage lightSrc="/images/metrics/metrics-human-detail-light.png" darkSrc="/images/metrics/metrics-human-detail-dark.png" caption="Human evaluation – provide guidelines for reviewers." alt="Human metric detail with evaluation guidelines" />
      </Tab>

      <Tab title="Heuristic (SDK)">
        Deterministic, code‑based checks (e.g., latency, regex, policy flags). Select <strong>Heuristic (SDK)</strong> as the evaluation type and provide a scorer function in Python or TypeScript.

        <DarkLightImage lightSrc="/images/metrics/metrics-heuristic-detail-light.png" darkSrc="/images/metrics/metrics-heuristic-detail-dark.png" caption="Heuristic metric – Python or TypeScript scorer function." alt="Heuristic metric detail with scorer function" />
      </Tab>

      <Tab title="Critic Agent (coming soon)">
        An agentic evaluator that can use tools, browse context, and reason over multiple steps before producing a score. Stay tuned for updates.
      </Tab>
    </Tabs>
  </Step>

  <Step title="Go to the Records page and select records">
    Navigate to your project's [Records page](/features/records). Select the records you want to score, then click the **Score Records** button.

    <DarkLightImage lightSrc="/images/metrics/metrics-records-page-light.png" darkSrc="/images/metrics/metrics-records-page-dark.png" caption="Records page with selected records and Score Records button." alt="Records page showing selected records and Score Records button" />
  </Step>

  <Step title="Choose metrics and score">
    In the **Score Records** modal, select one or more metrics to evaluate against, then click **Score**.

    <DarkLightImage lightSrc="/images/metrics/metrics-score-records-modal-light.png" darkSrc="/images/metrics/metrics-score-records-modal-dark.png" caption="Score Records modal – select metrics to evaluate." alt="Score Records modal with metrics selected" />
  </Step>

  <Step title="View scores in the record panel">
    Once scoring completes, click any record to open the side panel. View scores, inputs, outputs, and evaluation details.

    <DarkLightImage lightSrc="/images/metrics/metrics-record-scores-panel-light.png" darkSrc="/images/metrics/metrics-record-scores-panel-dark.png" caption="Record detail panel with metric scores." alt="Record side panel showing metric scores, inputs, and outputs" />
  </Step>
</Steps>

## Metric types

* **AI‑scored**: Uses a model to apply your guidelines consistently and at scale.
* **Human‑scored**: Great for nuanced judgments or gold‑standard baselines.
* **Heuristic (SDK)**: Deterministic, code‑based checks via the SDK (e.g., latency, regex, policy flags).
* **Critic Agent** *(coming soon)*: An agentic evaluator that reasons over multiple steps with tool use.
* **Output types**: Choose <strong>Boolean</strong> (pass/fail) or <strong>Integer (1–5)</strong>.

## Second‑party metrics (optional)

If you already use established evaluation libraries, you can mirror those metrics in Scorecard:

* **MLflow genai**: Relevance, Answer Relevance, Faithfulness, Answer Correctness, Answer Similarity
* **RAGAS**: Faithfulness, Answer Relevancy, Context Recall, Context Precision, Context Relevancy, Answer Semantic Similarity

Copy a matching template, then tailor the guidelines to your product domain.

## Best practices for strong metrics

* **Be specific.** Minimize ambiguity in guidelines; include “what not to do.”
* **Pick the right output type.** Use Boolean for hard requirements; 1–5 for nuance.
* **Keep temperature low.** Use ≈0 for repeatable AI scoring.
* **Pilot and tighten.** Run on 10–20 cases, then refine wording to reduce false positives.
* **Bundle into groups.** Combine complementary checks (e.g., Relevance + Faithfulness + Safety) to keep evaluations consistent.

<Note>
  Looking for vetted, ready‑to‑use metrics? Explore <a href="/features/metrics" className="underline">Best‑in‑Class Metrics</a> and copy templates (including MLflow and RAGAS). You can also create deterministic checks via the SDK using <strong>Heuristic</strong> metrics.
</Note>

## Related resources

<Card title="Runs" icon="play" href="/features/runs">
  Create and analyze evaluations
</Card>

<Card title="A/B Comparison" icon="git-merge" href="/features/a-b-comparison">
  Compare two runs side‑by‑side
</Card>

<Card title="Best‑in‑Class Metrics" icon="star" href="/features/metrics">
  Explore curated, proven metrics
</Card>

<Card title="API Reference" icon="code" href="/api-reference/create-metric">
  Create metrics via API
</Card>
