> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Metrics

export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => {
  const getAbsoluteUrl = src => {
    if (src.startsWith('http://') || src.startsWith('https://')) {
      return src;
    }
    const currentUrl = typeof window !== 'undefined' ? window.location.origin : '';
    if (currentUrl.includes('.mintlify.app')) {
      const subdomain = currentUrl.split('.')[0].replace('https://', '');
      return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`;
    } else if (currentUrl === 'https://docs.scorecard.io') {
      return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`;
    } else {
      return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`;
    }
  };
  const content = <>
      <img className="block dark:hidden" width={width} src={getAbsoluteUrl(lightSrc)} alt={alt} />
      <img className="hidden dark:block" width={width} src={getAbsoluteUrl(darkSrc || lightSrc.replace('light', 'dark'))} alt={alt} />
    </>;
  if (caption) {
    return <Frame caption={caption}>{content}</Frame>;
  } else {
    return content;
  }
};

<Note>
  Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application.
</Note>

When evaluating LLM applications, you start with a "vibe check"—manually testing prompts to see if responses make sense. As your application evolves, you need a systematic way to quantify and consistently measure response quality. Metrics provide this standard of measurement for LLM evaluation.

## Assess Your LLM Quality With Scorecard's Metrics

Scorecard's metric system is organized into three main components:

* **Metrics**: Individual evaluation criteria that assess specific aspects of your LLM's performance
* **Metric Groups**: Collections of related metrics for comprehensive evaluation
* **Templates**: Pre-built metric configurations you can copy and customize

Scorecard assists you in several ways in defining and organizing metrics for LLM evaluation.

### Metric Templates

Our Scorecard Core Metrics represent industry-standard benchmarks for LLM performance, validated by our team of evaluation experts. Explore these templates on the Metrics page under "Templates".

<DarkLightImage caption="Overview of Scorecard Core Metrics in Templates" alt="Overview of Scorecard Core Metrics in Templates" lightSrc="/images/metrics/metrics-templates-light.png" darkSrc="/images/metrics/metrics-templates-dark.png" />

Copy any template and customize it for your specific use case.

<DarkLightImage caption="Metric Template to Copy" alt="Metric Template to Copy" lightSrc="/images/metrics/metrics-template-light.png" darkSrc="/images/metrics/metrics-template-dark.png" />

#### MLflow-Inspired Metrics

Scorecard provides support for MLflow-style metrics with additional capabilities like aggregation, A/B comparison, and iteration. Available metrics include:

* **Relevance**: Evaluates how well the response aligns with the input and context
* **Answer Relevance**: Assesses relevance and applicability to the specific query
* **Faithfulness**: Measures factual consistency with provided context
* **Answer Correctness**: Evaluates accuracy against ground truth
* **Answer Similarity**: Assesses semantic similarity to expected responses

View full prompts on the Metrics page or in the [MLflow GitHub repository](https://github.com/mlflow/mlflow/blob/master/mlflow/metrics/genai/prompts/v1.py).

#### RAGAS-Inspired Metrics for RAG Pipelines

Scorecard provides RAGAS-style metrics for evaluating Retrieval Augmented Generation (RAG) systems:

**Component-Wise:**

* **Faithfulness**: Factual consistency with context
* **Answer Relevancy**: Pertinence to the prompt
* **Context Recall**: Retrieved context alignment with expected response
* **Context Precision**: Ranking of ground-truth relevant items
* **Context Relevancy**: Relevance of retrieved context to query

**End-to-End:**

* **Answer Semantic Similarity**: Semantic resemblance to ground truth
* **Answer Correctness**: Accuracy compared to ground truth

View full prompts on the Metrics page or in the [RAGAS GitHub repository](https://github.com/explodinggradients/ragas/tree/main/src/ragas/metrics).

<Note>
  Templates for common MLflow and RAGAS metrics are available under Templates. Copy a template (e.g., Relevance, Faithfulness) and tailor the guidelines to your domain.
</Note>

### Define Custom Metrics for Your LLM Use Case

<DarkLightImage caption="Adding a New Custom Metric" alt="Adding a New Custom Metric" lightSrc="/images/metrics/metrics-overview-light.png" darkSrc="/images/metrics/metrics-overview-dark.png" />

On the Metrics page, click "+ New Metric" to create a custom metric. Configure:

* **Metric Name**: Human-readable name
* **Metric Guidelines**: Natural language instructions for evaluation
* **Evaluation Type**:
  * **AI**: Uses Metric Guidelines as prompt for an AI model
  * **Human**: Manual evaluation by subject-matter experts
  * **Heuristic (Code)**: Custom Python or TypeScript logic (see [Heuristic Metrics](#heuristic-metrics))
* **Output Type**: Boolean, Integer (1-5), or Float (0.0-1.0)

<DarkLightImage caption="New Custom Metric" alt="New Custom Metric" lightSrc="/images/metrics/metrics-create-a-light.png" darkSrc="/images/metrics/metrics-create-a-dark.png" />

### Output Types

#### Boolean Output Type

* **Range**: true/false
* **Pass/Fail**: Direct pass/fail representation
* **Use cases**: Format checks, refusals, policy/guardrails, presence/absence
* **Aggregation**: Pass ratios and counts

Example model output for a boolean metric:

```json theme={null}
{
  "reasoning": "The model correctly refused unsafe content.",
  "binaryScore": true
}
```

#### Integer Output Type

* **Range**: 1–5 (higher is better)
* **Pass/Fail**: Set a `passingThreshold` (e.g., 4). Passes if `intScore ≥ threshold`
* **Use cases**: Rubric-based quality judgments (helpfulness, factuality, completeness)
* **Aggregation**: Means and distributions

Example model output for an integer metric:

```json theme={null}
{
  "reasoning": "Mostly correct but missing minor details.",
  "intScore": 4
}
```

#### Float Output Type

Normalized score between 0.0 and 1.0 for graded, continuous measurement.

* **Range**: 0.0–1.0 (higher is better)
* **Pass/Fail**: Set a `passingThreshold` (e.g., 0.90). Passes if `floatScore ≥ threshold`
* **Use cases**: Semantic similarity, confidence/uncertainty, coverage, quality scores
* **Aggregation**: Means for trend tracking

Example model output for a float metric:

```json theme={null}
{
  "reasoning": "The response is mostly correct with minor omissions.",
  "floatScore": 0.87
}
```

## Automated Scoring

As testsets grow, manual evaluation becomes time-consuming. Subject-matter experts (SMEs) can become bottlenecks as the number of testcases increases. Scorecard's AI-powered scoring automates evaluation, allowing SMEs to focus on complex edge cases while the system handles well-defined metrics.

### AI-Based Metrics

When creating a metric, select "AI" as the evaluation type for automated scoring.

<DarkLightImage caption="New Metric With AI-Scoring" alt="New Metric With AI-Scoring" lightSrc="/images/metrics/metrics-create-b-light.png" darkSrc="/images/metrics/metrics-create-b-dark.png" />

#### Configuration Modes

* **Basic Mode**: Define metric guidelines using natural language instructions
* **Advanced Mode**: Modify the full prompt template for sophisticated evaluation logic

<Warning>
  Be very specific with metric guidelines—they form the core instruction for AI scoring.
</Warning>

#### Evaluation Settings

Configure AI evaluation parameters:

* **Model Selection**: Choose models like GPT-4o
* **Temperature**: Control randomness (typically 0 for consistency)
* **Additional Parameters**: Model-specific settings

### Automatic Scoring

When you score records (from the testset page, the Records page, or through GitHub Actions), AI and Heuristic metrics are evaluated automatically. You can view scoring progress and results on the Records page.

<Tip>
  New to Scorecard? See the [UI Quickstart](/intro/ui-quickstart) to learn how to score your first records.
</Tip>

<DarkLightImage caption="Run Scoring in the Scorecard UI" alt="Run Scoring in the Scorecard UI" lightSrc="/images/metrics/metrics-run-light.png" darkSrc="/images/metrics/metrics-run-dark.png" />

If needed, you can score again by selecting records and clicking "Score Records".

#### Score Explanations

Scorecard provides explanations for each score, helping you understand and validate the AI's evaluation reasoning.

<DarkLightImage caption="Score Explained by the Scorecard AI Model" alt="Score Explained by the Scorecard AI Model" lightSrc="/images/metrics/metrics-scores-detail-light.png" darkSrc="/images/metrics/metrics-scores-detail-dark.png" />

### Heuristic Metrics (Code-Based Evaluation)

Write custom evaluation logic in Python or TypeScript for deterministic, rule-based scoring:

* Checking for specific keywords or patterns
* Validating JSON structure or format
* Computing text similarity scores
* Measuring response length or complexity
* Custom business logic checks

#### Creating a Heuristic Metric

Select **Heuristic** as the evaluation type to access the code editor.

<DarkLightImage caption="Creating a heuristic metric with Python code" alt="Screenshot of the create metric modal with Python heuristic code" lightSrc="/images/changelog-heuristic-metric-light.png" darkSrc="/images/changelog-heuristic-metric-dark.png" />

#### Writing Heuristic Code

Your code receives the record data and must return a score. Here's the structure:

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def score(inputs: dict, outputs: dict, expected: dict) -> dict:
        """
        Evaluate the output and return a score.
        
        Args:
            inputs: The input fields from the testcase
            outputs: The generated output from your system
            expected: The expected output fields from the testcase
        
        Returns:
            A dict with 'score' (bool, int 1-5, or float 0.0-1.0) 
            and optional 'reasoning'
        """
        # Example: Check if output contains expected keyword
        output_text = outputs.get("response", "")
        expected_keyword = expected.get("keyword", "")
        
        if expected_keyword.lower() in output_text.lower():
            return {"score": True, "reasoning": "Keyword found in response"}
        else:
            return {"score": False, "reasoning": "Keyword not found"}
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    function score(
      inputs: Record<string, any>,
      outputs: Record<string, any>,
      expected: Record<string, any>
    ): { score: boolean | number; reasoning?: string } {
      // Example: Check response length
      const response = outputs.response || "";
      const minLength = expected.minLength || 100;
      
      if (response.length >= minLength) {
        return { score: true, reasoning: `Response meets minimum length of ${minLength}` };
      } else {
        return { score: false, reasoning: `Response too short: ${response.length} < ${minLength}` };
      }
    }
    ```
  </Tab>
</Tabs>

#### Secure Sandbox Execution

Code runs in a secure, isolated sandbox ensuring:

* **Safety**: No external resource access
* **Consistency**: Deterministic results
* **Performance**: Optimized execution

<Note>
  The sandbox has limited access to external libraries. Common utilities like string manipulation and JSON parsing are available. Contact support if you need specific libraries for your use case.
</Note>

### Metric Groups

Group related metrics for consistent, repeatable evaluation. Metric Groups let you apply multiple metrics when scoring without manual selection each time. Create groups for specific use cases like RAG applications or translation tasks.

<DarkLightImage caption="Overview of Metric Groups" alt="Overview of Metric Groups" lightSrc="/images/metrics/metric-groups-light.png" darkSrc="/images/metrics/metrics-groups-dark.png" />

View and create Metric Groups on the Metrics page under "Metric Groups". Provide:

* **Name**: Human-readable identifier
* **Description**: Short summary of the group's purpose

<DarkLightImage caption="Defining a New Metric Group" alt="Defining a New Metric Group" lightSrc="/images/metrics/metrics-create-group-light.png" darkSrc="/images/metrics/metrics-create-group-dark.png" />

Click "Select Metrics" to choose which metrics to include in the group.
