Metrics

Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application.

When starting out with a LLM application, you begin with a “vibe check” by running a few prompts manually to see if the responses make sense. Then, you take it a step further by manually testing a set of your favorite prompts after each LLM iteration. However, you might wonder how to quantify and consistently test the quality of LLM responses. This is when you need metrics. A metric is a standard of measurement. In LLM evaluation, a metric serves as a benchmark for assessing the quality of LLM responses.

Assess Your LLM Quality With Scorecard’s Metrics

Scorecard’s metric system is organized into three main components:

Metrics: Individual evaluation criteria that assess specific aspects of your LLM’s performance
Metric Groups: Collections of related metrics that work together to provide comprehensive evaluation
Templates: Pre-built metric configurations that you can copy and customize for your use case

Scorecard assists you in several ways in defining and organizing these metrics for LLM evaluation. Let’s check out some of the advantages that Scorecard offers!

Metric Templates

The Scorecard team consists of experts in LLM evaluation, with extensive experience in assessing and deploying large-scale AI applications at some of the world’s leading companies. Our Scorecard Core Metrics, validated by our team, represent industry-standard benchmarks for LLM performance. You can explore these Core Metrics in the Scoring Lab, under “Templates,” to select the ones best suited for your LLM evaluation needs.

Overview of Scorecard Core Metrics in Templates

If you would like to use a specific template, simply make a copy of it and, if needed, edit its details before importing it into your metrics section.

Metric Template to Copy

Second-Party Metrics from MLflow

Scorecard provides support for running MLflow metrics directly within the Scorecard platform and provides complementary additional capabilities such as aggregation, A/B comparison, iteration and more. Several of MLflow’s metrics of the genai package are available in the Scorecard metrics library. These metrics include:

Relevance

Description: Evaluates the appropriateness, significance, and applicability of an LLM’s output in relation to the input and context.
Purpose: Assesses how well the model’s response aligns with the expected context or intent of the input.

Answer Relevance

Description: Assesses the relevance of a response provided by an LLM, considering the accuracy and applicability of the response to the given query.
Purpose: Focuses specifically on the relevance of responses in the context of queries posed to the LLM.

Faithfulness

Description: Measures the factual consistency of an LLM’s output with the provided context.
Purpose: Ensures that the model’s responses are not only relevant but also accurate and trustworthy, based on the information available in the context.

Answer Correctness

Description: Evaluates the accuracy of the provided output based on a given “ground truth” or expected response.
Purpose: Crucial for understanding how well an LLM can generate correct and reliable responses to specific queries.

Answer Similarity

Description: Assesses the semantic similarity between the model’s output and the provided targets or “ground truth”.
Purpose: Used to evaluate how closely the generated response matches the expected response in terms of meaning and content.

You can view the full prompts for these metrics in the Scorecard Scoring lab or in the MLflow GitHub repository.

Second-Party Metrics from RAGAS

You can also utilize metrics from the RAGAS framework, which is specialized in evaluating Retrieval Augmented Generation (RAG) pipelines. Scorecard also provides complementary additional capabilities such as aggregation, A/B comparison, iteration, and more.

Component-Wise Evaluation

To assess the performance of individual components within a RAG pipeline, you can leverage metrics such as:

Faithfulness: Measures the factual consistency of the generated response against the provided context.
Answer Relevancy: Assesses how pertinent the generated response is to the given prompt.
Context Recall: Evaluates the extent to which the retrieved context aligns with the annotated response.
Context Precision: Determines whether all ground-truth relevant items present in the contexts are ranked higher.
Context Relevancy: Evaluates how relevant the retrieved context is in addressing the provided query.

End-to-End Evaluation

To assess the end-to-end performance of a RAG pipeline, you can leverage metrics such as:

Answer Semantic Similarity: Assesses the semantic resemblance between the generated response and the ground truth.
Answer Correctness: Gauges the accuracy of the generated response when compared to the ground truth.

You can view the full prompts for these metrics in the Scorecard Scoring lab or in the RAGAS GitHub repository.

Templates for common MLflow and RAGAS metrics are available under Templates. Copy a template (e.g., Relevance, Faithfulness) and tailor the guidelines to your domain.

For additional context on RAG‑focused evaluations and dataset setup, see the RAGAS docs: RAGAS integrations guide.

Define Custom Metrics for Your LLM Use Case

Adding a New Custom Metric in Scoring Lab

Scorecard also offers the flexibility to define any customized metric unique to your LLM application. At the top of the metrics section in the Scoring Lab, click the ”+ New Metric” button and fill out the details of your metric. Information needed includes:

Metric Name: a human-readable name of your metric.
Metric Guidelines: natural language instructions to define how a metric should be computed.
Evaluation Type: how your metric will be evaluated
- AI: takes the Metric Guidelines as prompt and computes the metric via an AI model.
- Human: a human subject-matter expert manually evaluates the metric.
- Heuristic (SDK): this is currently only supported via SDK (check out this cookbook) - UI support is coming soon! 🚀
Output Type: the output type of your metric.
- Boolean
- Integer (1-5)
- Float (0.0-1.0)

New Custom Metric

Output Types

Boolean Output Type

Range: true/false.
Pass/Fail: binaryScore directly represents pass/fail.
Good for: format checks, refusals, policy/guardrails, presence/absence.
Aggregation: Ratios of passes over total and counts across runs and Metric Groups.

Example model output for a boolean metric:

{
  "reasoning": "The model correctly refused unsafe content.",
  "binaryScore": true
}

Integer Output Type

Range: 1–5 (coarse ordinal scale where higher is better).
Pass/Fail Threshold: Configure a passingThreshold (e.g., 4). A record passes if intScore ≥ passingThreshold.
Good for: rubric-based quality judgments (helpfulness, factuality, completeness) with simple guidance for SMEs.
Aggregation: Means and distributions across runs/testsets and within Metric Groups.

Example model output for an integer metric:

{
  "reasoning": "Mostly correct but missing minor details.",
  "intScore": 4
}

Float Output Type

Float metrics produce a normalized score between 0.0 and 1.0 and are useful when you want a graded, continuous measure instead of a binary or 1–5 integer rating.

Range: 0.0–1.0, where higher is better by convention.
Pass/Fail Threshold: Configure a passingThreshold (e.g., 0.90). A record passes if floatScore ≥ passingThreshold.
Good for: semantic similarity, confidence/uncertainty, coverage, or quality scores you expect to average across many samples.
Aggregation: Scorecard aggregates float metrics via means across runs/testsets and within Metric Groups, enabling trend tracking over time.

Example model output for a float metric:

{
  "reasoning": "The response is mostly correct with minor omissions.",
  "floatScore": 0.87
}

Automated Scoring

Increasing Complexity When Manually Evaluating Large Testsets

When introducing a framework to evaluate a Large Language Model (LLM), the process typically begins on a small scale by manually evaluating a few Testcases. These initial Testcases form the basis of the first Testset, and subject-matter experts (SMEs) evaluate the LLM outputs of these Testcases. As the evaluation process iterates, the number of Testcases grows, allowing testing across more dimensions using different metrics. However, as the number of Testcases increases, the time required to evaluate the metrics increases proportionally. Since SMEs have limited resources, the evaluation of large Testsets can quickly become a bottleneck, hindering the progress of the LLM evaluation process.

AI-Supported Scoring With Scorecard

Scorecard offers a solution to this problem by outsourcing the evaluation part to an AI model. With Scorecard’s AI-powered scoring, valuable time can be saved and SMEs can focus on complex edge cases instead of well-defined and easy-to-score metrics. This streamlines the LLM evaluation process, enabling teams to iterate and deliver faster.

Define Metrics to Score With AI

When defining a new metric to be used for your LLM evaluation, you can specify the evaluation type. Choose the evaluation type “AI” to have a metric automatically scored by an AI model.

New Metric With AI-Scoring

Basic and Advanced AI Metric Configuration

The AI metric creation flow now offers two modes:

Basic Mode: In this mode, you define the metric guidelines using natural language instructions. Make sure to accurately describe how the AI model should score the metric.
Advanced Mode: For more control, you can access the advanced mode to modify the full prompt template beyond just the guidelines, allowing for more sophisticated evaluation logic.

When specifying the metric guidelines for an AI-scored metric, you must be very specific, as this forms the core instruction the AI model receives for scoring the metric.

AI Evaluation Settings

In the AI Evaluation Settings section of the metric creation or editing modal, you can configure the model and parameters used for AI-based evaluation. This gives you fine-grained control over how your metrics are evaluated:

Model Selection: Choose from available models (such as GPT-4o) to perform the evaluation
Temperature: Control the randomness of the AI evaluation (typically set to 0 for consistent scoring)
Additional Parameters: Configure other model-specific settings as needed

This flexibility allows you to optimize the evaluation process for different types of metrics and use cases.

Score Your Testset With a Single Click

After you have run a Testset against your LLM and the model has generated responses for each Testcase, the Scorecard UI displays the status of “Awaiting Scoring”.

Run Scoring in the Scorecard UI

Score Explanations

Scorecard not only automates the scoring process, it also provides explanations of why the AI model scored the metrics in a certain way. This helps you understand the reasons of the AI scores, allowing you to re-evaluate whether the evaluations make sense.

Score Explained by the Scorecard AI Model

Metric Groups

Once you have defined all metrics that you need to properly assess the quality of your LLM application, you can group them together to form a Metric Group. A Scorecard Metric Group is a collection of metrics that are used together to score an LLM application. This Metric Group can be routinely run against your LLM application to yield a consistent measure of the performance and quality of its responses. Instead of manually selecting multiple metrics to score an LLM application every time for a particular use case, defining this set of metrics makes it easy to repeatedly score in the same way with multiple metrics. Different Metric Groups can serve different purposes by ensuring consistency in testing, e.g. a Metric Group for RAG applications, a Metric Group for translation use cases, etc.

Overview of Metric Groups in Scoring Lab

Have an overview of your existing Metric Groups in the Scoring Lab under “Metric Groups”. In this tab, you can also define a new Metric Group by providing the following information:

Name: a human-readable name of your Metric Group.
Description: a short description of your Metric Group.

Defining a New Metric Group

After specifying a name and description, select metrics by clicking on “Select Metrics”. On the next page, select one or multiple metrics to use for the Metric Group.

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

Assess Your LLM Quality With Scorecard’s Metrics

Metric Templates

Second-Party Metrics from MLflow

Second-Party Metrics from RAGAS

Component-Wise Evaluation

End-to-End Evaluation

Define Custom Metrics for Your LLM Use Case

Output Types

Boolean Output Type

Integer Output Type

Float Output Type

Automated Scoring

Increasing Complexity When Manually Evaluating Large Testsets

AI-Supported Scoring With Scorecard

Define Metrics to Score With AI

Basic and Advanced AI Metric Configuration

AI Evaluation Settings

Score Your Testset With a Single Click

Score Explanations

Metric Groups

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

​Assess Your LLM Quality With Scorecard’s Metrics

​Metric Templates

​Second-Party Metrics from MLflow

​Second-Party Metrics from RAGAS

Component-Wise Evaluation

End-to-End Evaluation

​Define Custom Metrics for Your LLM Use Case

​Output Types

​Boolean Output Type

​Integer Output Type

​Float Output Type

​Automated Scoring

​Increasing Complexity When Manually Evaluating Large Testsets

​AI-Supported Scoring With Scorecard

​Define Metrics to Score With AI

Basic and Advanced AI Metric Configuration

AI Evaluation Settings

​Score Your Testset With a Single Click

​Score Explanations

​Metric Groups

Assess Your LLM Quality With Scorecard’s Metrics

Metric Templates

Second-Party Metrics from MLflow

Second-Party Metrics from RAGAS

Define Custom Metrics for Your LLM Use Case

Output Types

Boolean Output Type

Integer Output Type

Float Output Type

Automated Scoring

Increasing Complexity When Manually Evaluating Large Testsets

AI-Supported Scoring With Scorecard

Define Metrics to Score With AI

Score Your Testset With a Single Click

Score Explanations

Metric Groups