Skip to main content
Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application.
When evaluating LLM applications, you start with a “vibe check”—manually testing prompts to see if responses make sense. As your application evolves, you need a systematic way to quantify and consistently measure response quality. Metrics provide this standard of measurement for LLM evaluation.

Assess Your LLM Quality With Scorecard’s Metrics

Scorecard’s metric system is organized into three main components:
  • Metrics: Individual evaluation criteria that assess specific aspects of your LLM’s performance
  • Metric Groups: Collections of related metrics for comprehensive evaluation
  • Templates: Pre-built metric configurations you can copy and customize
Scorecard assists you in several ways in defining and organizing metrics for LLM evaluation.

Metric Templates

Our Scorecard Core Metrics represent industry-standard benchmarks for LLM performance, validated by our team of evaluation experts. Explore these templates on the Metrics page under “Templates”.
Overview of Scorecard Core Metrics in TemplatesOverview of Scorecard Core Metrics in Templates
Copy any template and customize it for your specific use case.
Metric Template to CopyMetric Template to Copy

MLflow-Inspired Metrics

Scorecard provides support for MLflow-style metrics with additional capabilities like aggregation, A/B comparison, and iteration. Available metrics include:
  • Relevance: Evaluates how well the response aligns with the input and context
  • Answer Relevance: Assesses relevance and applicability to the specific query
  • Faithfulness: Measures factual consistency with provided context
  • Answer Correctness: Evaluates accuracy against ground truth
  • Answer Similarity: Assesses semantic similarity to expected responses
View full prompts on the Metrics page or in the MLflow GitHub repository.

RAGAS-Inspired Metrics for RAG Pipelines

Scorecard provides RAGAS-style metrics for evaluating Retrieval Augmented Generation (RAG) systems: Component-Wise:
  • Faithfulness: Factual consistency with context
  • Answer Relevancy: Pertinence to the prompt
  • Context Recall: Retrieved context alignment with expected response
  • Context Precision: Ranking of ground-truth relevant items
  • Context Relevancy: Relevance of retrieved context to query
End-to-End:
  • Answer Semantic Similarity: Semantic resemblance to ground truth
  • Answer Correctness: Accuracy compared to ground truth
View full prompts on the Metrics page or in the RAGAS GitHub repository.
Templates for common MLflow and RAGAS metrics are available under Templates. Copy a template (e.g., Relevance, Faithfulness) and tailor the guidelines to your domain.

Define Custom Metrics for Your LLM Use Case

Adding a New Custom MetricAdding a New Custom Metric
On the Metrics page, click ”+ New Metric” to create a custom metric. Configure:
  • Metric Name: Human-readable name
  • Metric Guidelines: Natural language instructions for evaluation
  • Evaluation Type:
    • AI: Uses Metric Guidelines as prompt for an AI model
    • Human: Manual evaluation by subject-matter experts
    • Heuristic (Code): Custom Python or TypeScript logic (see Heuristic Metrics)
  • Output Type: Boolean, Integer (1-5), or Float (0.0-1.0)
New Custom MetricNew Custom Metric

Output Types

Boolean Output Type

  • Range: true/false
  • Pass/Fail: Direct pass/fail representation
  • Use cases: Format checks, refusals, policy/guardrails, presence/absence
  • Aggregation: Pass ratios and counts
Example model output for a boolean metric:
{
  "reasoning": "The model correctly refused unsafe content.",
  "binaryScore": true
}

Integer Output Type

  • Range: 1–5 (higher is better)
  • Pass/Fail: Set a passingThreshold (e.g., 4). Passes if intScore ≥ threshold
  • Use cases: Rubric-based quality judgments (helpfulness, factuality, completeness)
  • Aggregation: Means and distributions
Example model output for an integer metric:
{
  "reasoning": "Mostly correct but missing minor details.",
  "intScore": 4
}

Float Output Type

Normalized score between 0.0 and 1.0 for graded, continuous measurement.
  • Range: 0.0–1.0 (higher is better)
  • Pass/Fail: Set a passingThreshold (e.g., 0.90). Passes if floatScore ≥ threshold
  • Use cases: Semantic similarity, confidence/uncertainty, coverage, quality scores
  • Aggregation: Means for trend tracking
Example model output for a float metric:
{
  "reasoning": "The response is mostly correct with minor omissions.",
  "floatScore": 0.87
}

Automated Scoring

As testsets grow, manual evaluation becomes time-consuming. Subject-matter experts (SMEs) can become bottlenecks as the number of testcases increases. Scorecard’s AI-powered scoring automates evaluation, allowing SMEs to focus on complex edge cases while the system handles well-defined metrics.

AI-Based Metrics

When creating a metric, select “AI” as the evaluation type for automated scoring.
New Metric With AI-ScoringNew Metric With AI-Scoring

Configuration Modes

  • Basic Mode: Define metric guidelines using natural language instructions
  • Advanced Mode: Modify the full prompt template for sophisticated evaluation logic
Be very specific with metric guidelines—they form the core instruction for AI scoring.

Evaluation Settings

Configure AI evaluation parameters:
  • Model Selection: Choose models like GPT-4o
  • Temperature: Control randomness (typically 0 for consistency)
  • Additional Parameters: Model-specific settings

Automatic Scoring

When you create a run via the Kickoff Run modal (from the testset page or the Kickoff Run page) or through GitHub Actions, AI and Heuristic metrics are automatically scored—no manual action needed. You can view scoring progress and results in the run details page.
New to Scorecard? See the UI Quickstart to learn how to kick off your first run.
Run Scoring in the Scorecard UIRun Scoring in the Scorecard UI
If needed, you can re-run scoring by selecting records and clicking “Re-run Scoring”.

Score Explanations

Scorecard provides explanations for each score, helping you understand and validate the AI’s evaluation reasoning.
Score Explained by the Scorecard AI ModelScore Explained by the Scorecard AI Model

Heuristic Metrics (Code-Based Evaluation)

Write custom evaluation logic in Python or TypeScript for deterministic, rule-based scoring:
  • Checking for specific keywords or patterns
  • Validating JSON structure or format
  • Computing text similarity scores
  • Measuring response length or complexity
  • Custom business logic checks

Creating a Heuristic Metric

Select Heuristic as the evaluation type to access the code editor.
Screenshot of the create metric modal with Python heuristic codeScreenshot of the create metric modal with Python heuristic code

Writing Heuristic Code

Your code receives the record data and must return a score. Here’s the structure:
def score(inputs: dict, outputs: dict, expected: dict) -> dict:
    """
    Evaluate the output and return a score.
    
    Args:
        inputs: The input fields from the testcase
        outputs: The generated output from your system
        expected: The expected output fields from the testcase
    
    Returns:
        A dict with 'score' (bool, int 1-5, or float 0.0-1.0) 
        and optional 'reasoning'
    """
    # Example: Check if output contains expected keyword
    output_text = outputs.get("response", "")
    expected_keyword = expected.get("keyword", "")
    
    if expected_keyword.lower() in output_text.lower():
        return {"score": True, "reasoning": "Keyword found in response"}
    else:
        return {"score": False, "reasoning": "Keyword not found"}

Secure Sandbox Execution

Code runs in a secure, isolated sandbox ensuring:
  • Safety: No external resource access
  • Consistency: Deterministic results
  • Performance: Optimized execution
The sandbox has limited access to external libraries. Common utilities like string manipulation and JSON parsing are available. Contact support if you need specific libraries for your use case.

Metric Groups

Group related metrics for consistent, repeatable evaluation. Metric Groups let you apply multiple metrics to runs without manual selection each time. Create groups for specific use cases like RAG applications or translation tasks.
Overview of Metric GroupsOverview of Metric Groups
View and create Metric Groups on the Metrics page under “Metric Groups”. Provide:
  • Name: Human-readable identifier
  • Description: Short summary of the group’s purpose
Defining a New Metric GroupDefining a New Metric Group
Click “Select Metrics” to choose which metrics to include in the group.