Metrics Quickstart

Metrics page overview — Metrics page with Metrics, Groups, and Templates tabs.

Metrics define what “good” looks like for your LLM. You describe the criteria (e.g., helpfulness, groundedness, safety), and Scorecard turns that into repeatable scores you can track across runs and over time.

Open Metrics and explore templates

Go to your project’s Metrics page. Start fast by copying a proven template, then tailor the guidelines to your domain.

Metric templates list — Templates list with Create from Template.

Create a metric

You can also create a metric from scratch. Provide a name, description, clear guidelines, and choose an Evaluation Type and Output Type.

Guidelines matter. Describe what to reward and what to penalize, and include 1–2 concise examples if helpful. These instructions become the core of the evaluator prompt.

AI‑scored
Human‑scored
Heuristic (SDK)
Critic Agent (coming soon)

Uses a model to apply your guidelines consistently and at scale. Pick the evaluator model and keep temperature low for repeatability.

AI metric detail with model and output type settings — AI metric – evaluator model, output type, and evaluation guidelines.

Go to the Records page and select records

Navigate to your project’s Records page. Select the records you want to score, then click the Score Records button.

Records page showing selected records and Score Records button — Records page with selected records and Score Records button.

Choose metrics and score

In the Score Records modal, select one or more metrics to evaluate against, then click Score.

Score Records modal with metrics selected — Score Records modal – select metrics to evaluate.

View scores in the record panel

Once scoring completes, click any record to open the side panel. View scores, inputs, outputs, and evaluation details.

Record side panel showing metric scores, inputs, and outputs — Record detail panel with metric scores.

Metric types

AI‑scored: Uses a model to apply your guidelines consistently and at scale.
Human‑scored: Great for nuanced judgments or gold‑standard baselines.
Heuristic (SDK): Deterministic, code‑based checks via the SDK (e.g., latency, regex, policy flags).
Critic Agent (coming soon): An agentic evaluator that reasons over multiple steps with tool use.
Output types: Choose Boolean (pass/fail) or Integer (1–5).

Second‑party metrics (optional)

If you already use established evaluation libraries, you can mirror those metrics in Scorecard:

MLflow genai: Relevance, Answer Relevance, Faithfulness, Answer Correctness, Answer Similarity
RAGAS: Faithfulness, Answer Relevancy, Context Recall, Context Precision, Context Relevancy, Answer Semantic Similarity

Copy a matching template, then tailor the guidelines to your product domain.

Best practices for strong metrics

Be specific. Minimize ambiguity in guidelines; include “what not to do.”
Pick the right output type. Use Boolean for hard requirements; 1–5 for nuance.
Keep temperature low. Use ≈0 for repeatable AI scoring.
Pilot and tighten. Run on 10–20 cases, then refine wording to reduce false positives.
Bundle into groups. Combine complementary checks (e.g., Relevance + Faithfulness + Safety) to keep evaluations consistent.

Looking for vetted, ready‑to‑use metrics? Explore Best‑in‑Class Metrics and copy templates (including MLflow and RAGAS). You can also create deterministic checks via the SDK using Heuristic metrics.

Runs

Create and analyze evaluations

A/B Comparison

Compare two runs side‑by‑side

Best‑in‑Class Metrics

Explore curated, proven metrics

API Reference

Create metrics via API

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

Metric types

Second‑party metrics (optional)

Best practices for strong metrics

Runs

A/B Comparison

Best‑in‑Class Metrics

API Reference

​Metric types

​Second‑party metrics (optional)

​Best practices for strong metrics

​Related resources

Runs

A/B Comparison

Best‑in‑Class Metrics

API Reference

Metric types

Second‑party metrics (optional)

Best practices for strong metrics

Related resources