
Schema of a Production RAG System

Simplified Schema of a Production RAG System
Already familiar with the SDK? You can reuse patterns from the SDK Quickstart and Tracing Quickstart. For production monitoring of RAG, see Monitoring Quickstart.
Steps
1
Setup accounts
Create a Scorecard account, then set your API key as an environment variable.
2
Install SDK (and OpenAI optionally)
Install the Scorecard SDK. Optionally add OpenAI if you want a realistic generator.
3
Create a minimal RAG system
We’ll evaluate a simple function that takes a query and retrievedContext and produces an answer. This mirrors a typical RAG loop where retrieval is done upstream and passed to the generator.
- Python
- JavaScript
4
Setup Scorecard client and Project
5
Create RAG testcases
Each testcase contains the user
query
, the retrievedContext
you expect to be used, and the idealAnswer
for judging correctness.6
Create AI judge metrics
Define two metrics: one for context‑use (boolean) and one for answer correctness (1–5). These use Jinja placeholders to reference testcase inputs and system outputs.
7
Run and evaluate
Use the helper to execute your RAG function across testcases and record scores in Scorecard.
8
Analyze results
Review the run’s per‑metric stats, per‑record scores, and trends. Use this to iterate on prompts, retrieval parameters, and re‑run.
Retrieval‑only and end‑to‑end tests
Beyond the simple loop above, you may separately evaluate retrieval quality (precision/recall/F1, MRR, NDCG) and combine with generation for end‑to‑end scoring.
Retrieval-Only LLM Testing in a RAG System

Different Types of Testing in a RAG System
Retrieval metrics in practice
- Precision: Of the retrieved items, how many are relevant?
- Recall: Of the relevant items, how many were retrieved?
- F1: Harmonic mean of precision and recall.
- MRR: Average reciprocal rank of the first relevant item.
- NDCG: Gain discounted by rank, normalized to the ideal ordering.
- Define representative queries (cover intents, edge cases, and long‑tail).
- For each query, collect relevant documents/chunks; annotate relevance (binary or graded).
- Include plausible hard negatives to stress the retriever.
- Write labeling guidelines; consider inter‑annotator agreement.
- Split into dev/test; iterate on retriever, then re‑score.

To operationalize RAG quality on live traffic, instrument traces (Tracing Quickstart) and enable continuous evaluation (Monitoring Quickstart). Scorecard will sample spans, extract prompts/completions, and create Runs automatically.
What’s next?
- Create richer datasets from production with Trace to Testcase.
- Tune retrieval settings and prompts; compare in Runs & Results.
- If you want expert‑labeled ground truth, contact support@getscorecard.ai.