Metrics

Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application.

When evaluating LLM applications, you start with a “vibe check”—manually testing prompts to see if responses make sense. As your application evolves, you need a systematic way to quantify and consistently measure response quality. Metrics provide this standard of measurement for LLM evaluation.

Assess Your LLM Quality With Scorecard’s Metrics

Scorecard’s metric system is organized into three main components:

Metrics: Individual evaluation criteria that assess specific aspects of your LLM’s performance
Metric Groups: Collections of related metrics for comprehensive evaluation
Templates: Pre-built metric configurations you can copy and customize

Scorecard assists you in several ways in defining and organizing metrics for LLM evaluation.

Metric Templates

Our Scorecard Core Metrics represent industry-standard benchmarks for LLM performance, validated by our team of evaluation experts. Explore these templates on the Metrics page under “Templates”.

Overview of Scorecard Core Metrics in Templates

Copy any template and customize it for your specific use case.

MLflow-Inspired Metrics

Scorecard provides support for MLflow-style metrics with additional capabilities like aggregation, A/B comparison, and iteration. Available metrics include:

Relevance: Evaluates how well the response aligns with the input and context
Answer Relevance: Assesses relevance and applicability to the specific query
Faithfulness: Measures factual consistency with provided context
Answer Correctness: Evaluates accuracy against ground truth
Answer Similarity: Assesses semantic similarity to expected responses

View full prompts on the Metrics page or in the MLflow GitHub repository.

RAGAS-Inspired Metrics for RAG Pipelines

Scorecard provides RAGAS-style metrics for evaluating Retrieval Augmented Generation (RAG) systems: Component-Wise:

Faithfulness: Factual consistency with context
Answer Relevancy: Pertinence to the prompt
Context Recall: Retrieved context alignment with expected response
Context Precision: Ranking of ground-truth relevant items
Context Relevancy: Relevance of retrieved context to query

End-to-End:

Answer Semantic Similarity: Semantic resemblance to ground truth
Answer Correctness: Accuracy compared to ground truth

View full prompts on the Metrics page or in the RAGAS GitHub repository.

Templates for common MLflow and RAGAS metrics are available under Templates. Copy a template (e.g., Relevance, Faithfulness) and tailor the guidelines to your domain.

Define Custom Metrics for Your LLM Use Case

On the Metrics page, click ”+ New Metric” to create a custom metric. Configure:

Metric Name: Human-readable name
Metric Guidelines: Natural language instructions for evaluation
Evaluation Type:
- AI: Uses Metric Guidelines as prompt for an AI model
- Human: Manual evaluation by subject-matter experts
- Heuristic (Code): Custom Python or TypeScript logic (see Heuristic Metrics)
Output Type: Boolean, Integer (1-5), or Float (0.0-1.0)

Output Types

Boolean Output Type

Range: true/false
Pass/Fail: Direct pass/fail representation
Use cases: Format checks, refusals, policy/guardrails, presence/absence
Aggregation: Pass ratios and counts

Example model output for a boolean metric:

{
  "reasoning": "The model correctly refused unsafe content.",
  "binaryScore": true
}

Integer Output Type

Range: 1–5 (higher is better)
Pass/Fail: Set a passingThreshold (e.g., 4). Passes if intScore ≥ threshold
Use cases: Rubric-based quality judgments (helpfulness, factuality, completeness)
Aggregation: Means and distributions

Example model output for an integer metric:

{
  "reasoning": "Mostly correct but missing minor details.",
  "intScore": 4
}

Float Output Type

Normalized score between 0.0 and 1.0 for graded, continuous measurement.

Range: 0.0–1.0 (higher is better)
Pass/Fail: Set a passingThreshold (e.g., 0.90). Passes if floatScore ≥ threshold
Use cases: Semantic similarity, confidence/uncertainty, coverage, quality scores
Aggregation: Means for trend tracking

Example model output for a float metric:

{
  "reasoning": "The response is mostly correct with minor omissions.",
  "floatScore": 0.87
}

Automated Scoring

As testsets grow, manual evaluation becomes time-consuming. Subject-matter experts (SMEs) can become bottlenecks as the number of testcases increases. Scorecard’s AI-powered scoring automates evaluation, allowing SMEs to focus on complex edge cases while the system handles well-defined metrics.

AI-Based Metrics

When creating a metric, select “AI” as the evaluation type for automated scoring.

Configuration Modes

Basic Mode: Define metric guidelines using natural language instructions
Advanced Mode: Modify the full prompt template for sophisticated evaluation logic

Be very specific with metric guidelines—they form the core instruction for AI scoring.

Evaluation Settings

Configure AI evaluation parameters:

Model Selection: Choose models like GPT-4o
Temperature: Control randomness (typically 0 for consistency)
Additional Parameters: Model-specific settings

Automatic Scoring

When you score records (from the testset page, the Records page, or through GitHub Actions), AI and Heuristic metrics are evaluated automatically. You can view scoring progress and results on the Records page.

New to Scorecard? See the UI Quickstart to learn how to score your first records.

If needed, you can score again by selecting records and clicking “Score Records”.

Score Explanations

Scorecard provides explanations for each score, helping you understand and validate the AI’s evaluation reasoning.

Score Explained by the Scorecard AI Model

Heuristic Metrics (Code-Based Evaluation)

Write custom evaluation logic in Python or TypeScript for deterministic, rule-based scoring:

Checking for specific keywords or patterns
Validating JSON structure or format
Computing text similarity scores
Measuring response length or complexity
Custom business logic checks

Creating a Heuristic Metric

Select Heuristic as the evaluation type to access the code editor.

Screenshot of the create metric modal with Python heuristic code

Writing Heuristic Code

Your code receives the record data and must return a score. Here’s the structure:

Python
TypeScript

def score(inputs: dict, outputs: dict, expected: dict) -> dict:
    """
    Evaluate the output and return a score.
    
    Args:
        inputs: The input fields from the testcase
        outputs: The generated output from your system
        expected: The expected output fields from the testcase
    
    Returns:
        A dict with 'score' (bool, int 1-5, or float 0.0-1.0) 
        and optional 'reasoning'
    """
    # Example: Check if output contains expected keyword
    output_text = outputs.get("response", "")
    expected_keyword = expected.get("keyword", "")
    
    if expected_keyword.lower() in output_text.lower():
        return {"score": True, "reasoning": "Keyword found in response"}
    else:
        return {"score": False, "reasoning": "Keyword not found"}

function score(
  inputs: Record<string, any>,
  outputs: Record<string, any>,
  expected: Record<string, any>
): { score: boolean | number; reasoning?: string } {
  // Example: Check response length
  const response = outputs.response || "";
  const minLength = expected.minLength || 100;
  
  if (response.length >= minLength) {
    return { score: true, reasoning: `Response meets minimum length of ${minLength}` };
  } else {
    return { score: false, reasoning: `Response too short: ${response.length} < ${minLength}` };
  }
}

Secure Sandbox Execution

Code runs in a secure, isolated sandbox ensuring:

Safety: No external resource access
Consistency: Deterministic results
Performance: Optimized execution

The sandbox has limited access to external libraries. Common utilities like string manipulation and JSON parsing are available. Contact support if you need specific libraries for your use case.

Metric Groups

Group related metrics for consistent, repeatable evaluation. Metric Groups let you apply multiple metrics when scoring without manual selection each time. Create groups for specific use cases like RAG applications or translation tasks.

View and create Metric Groups on the Metrics page under “Metric Groups”. Provide:

Name: Human-readable identifier
Description: Short summary of the group’s purpose

Click “Select Metrics” to choose which metrics to include in the group.

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

Assess Your LLM Quality With Scorecard’s Metrics

Metric Templates

MLflow-Inspired Metrics

RAGAS-Inspired Metrics for RAG Pipelines

Define Custom Metrics for Your LLM Use Case

Output Types

Boolean Output Type

Integer Output Type

Float Output Type

Automated Scoring

AI-Based Metrics

Configuration Modes

Evaluation Settings

Automatic Scoring

Score Explanations

Heuristic Metrics (Code-Based Evaluation)

Creating a Heuristic Metric

Writing Heuristic Code

Secure Sandbox Execution

Metric Groups

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

​Assess Your LLM Quality With Scorecard’s Metrics

​Metric Templates

​MLflow-Inspired Metrics

​RAGAS-Inspired Metrics for RAG Pipelines

​Define Custom Metrics for Your LLM Use Case

​Output Types

​Boolean Output Type

​Integer Output Type

​Float Output Type

​Automated Scoring

​AI-Based Metrics

​Configuration Modes

​Evaluation Settings

​Automatic Scoring

​Score Explanations

​Heuristic Metrics (Code-Based Evaluation)

​Creating a Heuristic Metric

​Writing Heuristic Code

​Secure Sandbox Execution

​Metric Groups

Assess Your LLM Quality With Scorecard’s Metrics

Metric Templates

MLflow-Inspired Metrics

RAGAS-Inspired Metrics for RAG Pipelines

Define Custom Metrics for Your LLM Use Case

Output Types

Boolean Output Type

Integer Output Type

Float Output Type

Automated Scoring

AI-Based Metrics

Configuration Modes

Evaluation Settings

Automatic Scoring

Score Explanations

Heuristic Metrics (Code-Based Evaluation)

Creating a Heuristic Metric

Writing Heuristic Code

Secure Sandbox Execution

Metric Groups