RAG pairs retrieval with generation so your LLM can answer using fresh, domain‑specific context. This quickstart shows how to evaluate a simple RAG loop using Scorecard’s SDK, then highlights how to extend to retrieval‑only and end‑to‑end tests.
Schema of a Production RAG System

Schema of a Production RAG System

We’ll simplify to the core pieces you need to test:
Simplified Schema of a Production RAG System

Simplified Schema of a Production RAG System

Already familiar with the SDK? You can reuse patterns from the SDK Quickstart and Tracing Quickstart. For production monitoring of RAG, see Monitoring Quickstart.

Steps

1

Setup accounts

Create a Scorecard account, then set your API key as an environment variable.
export SCORECARD_API_KEY="your_scorecard_api_key"
2

Install SDK (and OpenAI optionally)

Install the Scorecard SDK. Optionally add OpenAI if you want a realistic generator.
pip install scorecard-ai openai
3

Create a minimal RAG system

We’ll evaluate a simple function that takes a query and retrievedContext and produces an answer. This mirrors a typical RAG loop where retrieval is done upstream and passed to the generator.
from openai import OpenAI

# Uses OPENAI_API_KEY from environment
openai = OpenAI()

# Example input shape:
# {"query": "What is RAG?", "retrievedContext": "..."}
def run_system(inputs: dict) -> dict:
    messages = [
        {"role": "system", "content": "Answer using only the provided context."},
        {"role": "user", "content": f"Context:\n{inputs['retrievedContext']}\n\nQuestion: {inputs['query']}"},
    ]
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0,
    )
    return {"answer": resp.choices[0].message.content}
4

Setup Scorecard client and Project

from scorecard_ai import Scorecard
scorecard = Scorecard()  # API key read from env
PROJECT_ID = "123"  # Replace with your Project ID
5

Create RAG testcases

Each testcase contains the user query, the retrievedContext you expect to be used, and the idealAnswer for judging correctness.
testcases = [
    {
        "inputs": {
            "query": "What does RAG stand for?",
            "retrievedContext": "RAG stands for Retrieval Augmented Generation.",
        },
        "expected": {
            "idealAnswer": "Retrieval Augmented Generation",
        },
    },
    {
        "inputs": {
            "query": "Why use retrieval?",
            "retrievedContext": "Retrieval injects fresh, domain‑specific context into the LLM.",
        },
        "expected": {
            "idealAnswer": "To ground the model on current and domain data.",
        },
    },
]
6

Create AI judge metrics

Define two metrics: one for context‑use (boolean) and one for answer correctness (1–5). These use Jinja placeholders to reference testcase inputs and system outputs.
context_use_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Context use",
    description="Does the answer rely only on retrieved context?",
    eval_type="ai",
    output_type="boolean",
    prompt_template="""
      Evaluate if the answer uses only the provided context and does not hallucinate.
      Context: {{ inputs.retrievedContext }}
      Answer: {{ outputs.answer }}

      {{ gradingInstructionsAndExamples }}
    """,
)

correctness_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Answer correctness",
    description="How correct is the answer vs the ideal (1–5)?",
    eval_type="ai",
    output_type="int",
    prompt_template="""
      Compare the answer to the ideal answer and score 1–5 (5 = exact).
      Question: {{ inputs.query }}
      Context: {{ inputs.retrievedContext }}
      Answer: {{ outputs.answer }}
      Ideal: {{ expected.idealAnswer }}

      {{ gradingInstructionsAndExamples }}
    """,
)
7

Run and evaluate

Use the helper to execute your RAG function across testcases and record scores in Scorecard.
from scorecard_ai.lib import run_and_evaluate

run = run_and_evaluate(
    client=scorecard,
    project_id=PROJECT_ID,
    testcases=testcases,
    metric_ids=[context_use_metric.id, correctness_metric.id],
    system=lambda inputs, _version: run_system(inputs),
)
print(f"Go to {run['url']} to view your scored results.")
8

Analyze results

Review the run’s per‑metric stats, per‑record scores, and trends. Use this to iterate on prompts, retrieval parameters, and re‑run.

Retrieval‑only and end‑to‑end tests

Beyond the simple loop above, you may separately evaluate retrieval quality (precision/recall/F1, MRR, NDCG) and combine with generation for end‑to‑end scoring.
Retrieval-Only LLM Testing in a RAG System

Retrieval-Only LLM Testing in a RAG System

Different Types of Testing in a RAG System

Different Types of Testing in a RAG System

Retrieval metrics in practice
  • Precision: Of the retrieved items, how many are relevant?
  • Recall: Of the relevant items, how many were retrieved?
  • F1: Harmonic mean of precision and recall.
  • MRR: Average reciprocal rank of the first relevant item.
  • NDCG: Gain discounted by rank, normalized to the ideal ordering.
Ground‑truth dataset checklist
  • Define representative queries (cover intents, edge cases, and long‑tail).
  • For each query, collect relevant documents/chunks; annotate relevance (binary or graded).
  • Include plausible hard negatives to stress the retriever.
  • Write labeling guidelines; consider inter‑annotator agreement.
  • Split into dev/test; iterate on retriever, then re‑score.
Trigger a run
To operationalize RAG quality on live traffic, instrument traces (Tracing Quickstart) and enable continuous evaluation (Monitoring Quickstart). Scorecard will sample spans, extract prompts/completions, and create Runs automatically.

What’s next?

  • Create richer datasets from production with Trace to Testcase.
  • Tune retrieval settings and prompts; compare in Runs & Results.
  • If you want expert‑labeled ground truth, contact support@getscorecard.ai.
Scorecard works alongside RAG frameworks like LlamaIndex and LangChain.