> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG Quickstart

> Evaluate a Retrieval Augmented Generation (RAG) agent with Scorecard in minutes.

RAG pairs retrieval with generation so your LLM can answer using fresh, domain‑specific context. This quickstart shows how to evaluate a simple RAG loop using Scorecard’s SDK, then highlights how to extend to retrieval‑only and end‑to‑end tests.

<Frame caption="Schema of a Production RAG System">
  <img src="https://mintcdn.com/scorecard-d65b5e8a/W_qF4JuImCEvs-ha/images/rag/1.png?fit=max&auto=format&n=W_qF4JuImCEvs-ha&q=85&s=72fce373459a9e1b714061c2eafd66a5" alt="Schema of a Production RAG System" width="1710" height="682" data-path="images/rag/1.png" />
</Frame>

We’ll simplify to the core pieces you need to test:

<Frame caption="Simplified Schema of a Production RAG System">
  <img src="https://mintcdn.com/scorecard-d65b5e8a/W_qF4JuImCEvs-ha/images/rag/2.png?fit=max&auto=format&n=W_qF4JuImCEvs-ha&q=85&s=5e223da171eaf0099112fb8e7ab473d0" alt="Simplified Schema of a Production RAG System" width="1810" height="294" data-path="images/rag/2.png" />
</Frame>

<Info>
  Already familiar with the SDK? You can reuse patterns from the [SDK Quickstart](/intro/quickstart) and [Tracing Quickstart](/intro/tracing-quickstart).
</Info>

## Steps

<Steps>
  <Step title="Setup accounts">
    Create a [Scorecard account](https://app.scorecard.io/dashboard), then set your API key as an environment variable.

    <CodeGroup>
      ```sh theme={null}
      export SCORECARD_API_KEY="your_scorecard_api_key"
      ```
    </CodeGroup>
  </Step>

  <Step title="Install SDK (and OpenAI optionally)">
    Install the Scorecard SDK. Optionally add OpenAI if you want a realistic generator.

    <CodeGroup>
      ```sh Python (pip) theme={null}
      pip install scorecard-ai openai
      ```

      ```sh JavaScript (npm) theme={null}
      npm install scorecard-ai openai
      ```
    </CodeGroup>
  </Step>

  <Step title="Create a minimal RAG system">
    We’ll evaluate a simple function that takes a query and retrievedContext and produces an answer. This mirrors a typical RAG loop where retrieval is done upstream and passed to the generator.

    <Tabs>
      <Tab title="Python">
        <CodeGroup>
          ```py Python wrap theme={null}
          from openai import OpenAI

          # Uses OPENAI_API_KEY from environment
          openai = OpenAI()

          # Example input shape:
          # {"query": "What is RAG?", "retrievedContext": "..."}
          def run_system(inputs: dict) -> dict:
              messages = [
                  {"role": "system", "content": "Answer using only the provided context."},
                  {"role": "user", "content": f"Context:\n{inputs['retrievedContext']}\n\nQuestion: {inputs['query']}"},
              ]
              resp = openai.chat.completions.create(
                  model="gpt-4o-mini",
                  messages=messages,
                  temperature=0,
              )
              return {"answer": resp.choices[0].message.content}
          ```
        </CodeGroup>
      </Tab>

      <Tab title="JavaScript">
        <CodeGroup>
          ```js JavaScript wrap theme={null}
          import OpenAI from 'openai';

          // Uses OPENAI_API_KEY from environment
          const openai = new OpenAI();

          // Example input shape:
          // { query: "What is RAG?", retrievedContext: "..." }
          export async function runSystem(inputs) {
            const messages = [
              { role: 'system', content: 'Answer using only the provided context.' },
              { role: 'user', content: `Context:\n${inputs.retrievedContext}\n\nQuestion: ${inputs.query}` },
            ];
            const resp = await openai.chat.completions.create({
              model: 'gpt-4o-mini',
              messages,
              temperature: 0,
            });
            return { answer: resp.choices[0].message.content };
          }
          ```
        </CodeGroup>
      </Tab>
    </Tabs>
  </Step>

  <Step title="Setup Scorecard client and Project">
    <CodeGroup>
      ```py Python theme={null}
      from scorecard_ai import Scorecard
      scorecard = Scorecard()  # API key read from env
      PROJECT_ID = "123"  # Replace with your Project ID
      ```

      ```js JavaScript theme={null}
      import Scorecard from 'scorecard-ai';
      const scorecard = new Scorecard(); // API key read from env
      const PROJECT_ID = '123'; // Replace with your Project ID
      ```
    </CodeGroup>
  </Step>

  <Step title="Create RAG testcases">
    Each testcase contains the user `query`, the `retrievedContext` you expect to be used, and the `idealAnswer` for judging correctness.

    <CodeGroup>
      ```py Python theme={null}
      testcases = [
          {
              "inputs": {
                  "query": "What does RAG stand for?",
                  "retrievedContext": "RAG stands for Retrieval Augmented Generation.",
              },
              "expected": {
                  "idealAnswer": "Retrieval Augmented Generation",
              },
          },
          {
              "inputs": {
                  "query": "Why use retrieval?",
                  "retrievedContext": "Retrieval injects fresh, domain‑specific context into the LLM.",
              },
              "expected": {
                  "idealAnswer": "To ground the model on current and domain data.",
              },
          },
      ]
      ```

      ```js JavaScript theme={null}
      const testcases = [
        {
          inputs: {
            query: 'What does RAG stand for?',
            retrievedContext: 'RAG stands for Retrieval Augmented Generation.',
          },
          expected: {
            idealAnswer: 'Retrieval Augmented Generation',
          },
        },
        {
          inputs: {
            query: 'Why use retrieval?',
            retrievedContext: 'Retrieval injects fresh, domain‑specific context into the LLM.',
          },
          expected: {
            idealAnswer: 'To ground the model on current and domain data.',
          },
        },
      ];
      ```
    </CodeGroup>
  </Step>

  <Step title="Create AI judge metrics">
    Define two metrics: one for context‑use (boolean) and one for answer correctness (1–5). These use Jinja placeholders to reference testcase inputs and system outputs.

    <CodeGroup>
      ```py Python wrap theme={null}
      context_use_metric = scorecard.metrics.create(
          project_id=PROJECT_ID,
          name="Context use",
          description="Does the answer rely only on retrieved context?",
          eval_type="ai",
          output_type="boolean",
          prompt_template="""
            Evaluate if the answer uses only the provided context and does not hallucinate.
            Context: {{ inputs.retrievedContext }}
            Answer: {{ outputs.answer }}

            {{ gradingInstructionsAndExamples }}
          """,
      )

      correctness_metric = scorecard.metrics.create(
          project_id=PROJECT_ID,
          name="Answer correctness",
          description="How correct is the answer vs the ideal (1–5)?",
          eval_type="ai",
          output_type="int",
          prompt_template="""
            Compare the answer to the ideal answer and score 1–5 (5 = exact).
            Question: {{ inputs.query }}
            Context: {{ inputs.retrievedContext }}
            Answer: {{ outputs.answer }}
            Ideal: {{ expected.idealAnswer }}

            {{ gradingInstructionsAndExamples }}
          """,
      )
      ```

      ```js JavaScript wrap theme={null}
      const contextUseMetric = await scorecard.metrics.create({
        projectId: PROJECT_ID,
        name: 'Context use',
        description: 'Does the answer rely only on retrieved context?',
        evalType: 'ai',
        outputType: 'boolean',
        promptTemplate:
          'Evaluate if the answer uses only the provided context and does not hallucinate.\n' +
          'Context: {{ inputs.retrievedContext }}\n' +
          'Answer: {{ outputs.answer }}\n\n' +
          '{{ gradingInstructionsAndExamples }}',
      });

      const correctnessMetric = await scorecard.metrics.create({
        projectId: PROJECT_ID,
        name: 'Answer correctness',
        description: 'How correct is the answer vs the ideal (1–5)?',
        evalType: 'ai',
        outputType: 'int',
        promptTemplate:
          'Compare the answer to the ideal answer and score 1–5 (5 = exact).\n' +
          'Question: {{ inputs.query }}\n' +
          'Context: {{ inputs.retrievedContext }}\n' +
          'Answer: {{ outputs.answer }}\n' +
          'Ideal: {{ expected.idealAnswer }}\n\n' +
          '{{ gradingInstructionsAndExamples }}',
      });
      ```
    </CodeGroup>
  </Step>

  <Step title="Run and evaluate">
    Use the helper to execute your RAG function across testcases and record scores in Scorecard.

    <CodeGroup>
      ```py Python wrap theme={null}
      from scorecard_ai.lib import run_and_evaluate

      run = run_and_evaluate(
          client=scorecard,
          project_id=PROJECT_ID,
          testcases=testcases,
          metric_ids=[context_use_metric.id, correctness_metric.id],
          system=lambda inputs, _version: run_system(inputs),
      )
      print(f"Go to {run['url']} to view your scored results.")
      ```

      ```js JavaScript wrap theme={null}
      import { runAndEvaluate } from 'scorecard-ai';

      const run = await runAndEvaluate(scorecard, {
        projectId: PROJECT_ID,
        testcases,
        metricIds: [contextUseMetric.id, correctnessMetric.id],
        system: runSystem,
      });
      console.log(`Go to ${run.url} to view your scored results.`);
      ```
    </CodeGroup>
  </Step>

  <Step title="Analyze results">
    Review the run’s per‑metric stats, per‑record scores, and trends. Use this to iterate on prompts, retrieval parameters, and re‑run.

    <DarkLightImage lightSrc="/images/quickstart-run-light.png" caption="Viewing results in the Scorecard UI." alt="Screenshot of viewing results in the Scorecard UI." />
  </Step>
</Steps>

## Retrieval‑only and end‑to‑end tests

Beyond the simple loop above, you may separately evaluate retrieval quality (precision/recall/F1, MRR, NDCG) and combine with generation for end‑to‑end scoring.

<Frame caption="Retrieval-Only LLM Testing in a RAG System">
  <img src="https://mintcdn.com/scorecard-d65b5e8a/W_qF4JuImCEvs-ha/images/rag/3.png?fit=max&auto=format&n=W_qF4JuImCEvs-ha&q=85&s=cd4b2b99bfa8e8fd369febdfae060228" alt="Retrieval-Only LLM Testing in a RAG System" width="1812" height="536" data-path="images/rag/3.png" />
</Frame>

<Frame caption="Different Types of Testing in a RAG System">
  <img src="https://mintcdn.com/scorecard-d65b5e8a/W_qF4JuImCEvs-ha/images/rag/4.png?fit=max&auto=format&n=W_qF4JuImCEvs-ha&q=85&s=d0c1aaedfcb7a698c227d17ed70df537" alt="Different Types of Testing in a RAG System" width="1460" height="440" data-path="images/rag/4.png" />
</Frame>

<Tip>
  **Retrieval metrics in practice**

  * **Precision**: Of the retrieved items, how many are relevant?
  * **Recall**: Of the relevant items, how many were retrieved?
  * **F1**: Harmonic mean of precision and recall.
  * **MRR**: Average reciprocal rank of the first relevant item.
  * **NDCG**: Gain discounted by rank, normalized to the ideal ordering.

  **Ground‑truth dataset checklist**

  * Define representative queries (cover intents, edge cases, and long‑tail).
  * For each query, collect relevant documents/chunks; annotate relevance (binary or graded).
  * Include plausible hard negatives to stress the retriever.
  * Write labeling guidelines; consider inter‑annotator agreement.
  * Split into dev/test; iterate on retriever, then re‑score.
</Tip>

<img src="https://mintcdn.com/scorecard-d65b5e8a/ACSkl-xBQxg-5vWT/images/trigger-run-dark.png?fit=max&auto=format&n=ACSkl-xBQxg-5vWT&q=85&s=e639c5f97a63bf42e46cd3109cd10934" alt="Trigger a run" width="2672" height="1734" data-path="images/trigger-run-dark.png" />

<Note>
  To operationalize RAG quality on live traffic, instrument traces ([Tracing Quickstart](/intro/tracing-quickstart)). Scorecard will sample spans, extract prompts/completions, and create Runs automatically.
</Note>

Scorecard works alongside RAG frameworks like [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/).
