In this quickstart we will:

  • Get an API key and create a Testset
  • Create an example LLM app
  • Execute a script with the Scorecard SDK within our production application
  • Review results in the Scorecard UI

You can use this Observable notebook to run this tutorial client-side.

Steps

1

Setup

First let’s create a Scorecard account and find your Scorecard API Key. Then we’ll get a OpenAI API Key, and set these as environment variables and also install the Scorecard and OpenAI Node Libraries:

Node install
export SCORECARD_API_KEY="SCORECARD_API_KEY"
export OPENAI_API_KEY="OPENAI_API_KEY"
npm install scorecard-ai
npm install openai
2

Create Testcases

Now let’s create and run a create_testset.js script to create a Testset in a specific project and add some Testcases using the SDK. Testcases are a way to collect examples that you can run evaluations against and improve over time. After we create a Testset, we’ll grab that Testset ID to use later:

const { ScorecardClient } = require("scorecard-ai");
require("dotenv").config();

const scorecard = new ScorecardClient({ apiKey: process.env.SCORECARD_API_KEY });

async function createTestset() {
  const testset = await scorecard.testset.create({
    name: "MMLU Demo",
    description: "Demo of a MMLU testset created via Scorecard Node SDK",
    projectId: 1, // Use your actual project ID here
  });

  // Add three Testcases
  await scorecard.testcase.create(testset.id, {
    userQuery:
      "The amount of access cabinet secretaries have to the president is most likely to be controlled by the",
  });
  await scorecard.testcase.create(testset.id, {
    userQuery: "The exclusionary rule was established to",
  });
  await scorecard.testcase.create(testset.id, {
    userQuery:
      "Ruled unconstitutional in 1983, the legislative veto had allowed",
  });

  console.log("Visit the Scorecard UI to view your Testset:");
  console.log(
    `https://app.getscorecard.ai/projects/${testset.projectId}/testsets/${testset.id}`
  );

  return testset;
}
3

Create Test System

Next let’s create a function which will represent our system under test. Here we’ve created a system that will be a helpful assistant in response to our input user query.

const { OpenAI } = require("openai");
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function answerQuery(userPrompt) {
  const chatCompletion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: userPrompt },
    ],
  });

  return chatCompletion.data.choices[0].message.content;
}
4

Create Metrics

Now that we have a system that answers questions from the MMLU dataset, let’s build a metric to understand how relevant the system responses are to our user query. Let’s go to the Metrics page and select “New Metric”

Scorecard UI: New Metric

From here, let’s create a metric for answer relevancy:

Scorecard UI: Metric Definition

You can evaluate your LLM systems with one or multiple metrics. A good practice is to routinely test the LLM system with the same metrics for a specific use case. For this, Scorecard lets you define Scoring Configs, collection of metrics to be used to consistently evaluate LLM use cases with the same metrics. For the quick start, we will create a Scoring Config containing only the previously created Answer Relevancy metric. Let’s head over to the “Scoring Config” tab in the Scoring Lab and create this Scoring Config. Let’s grab that Scoring Config ID for later:

Scorecard UI: Scoring Config

5

Create Test System

Now let’s use our mock system and run our Testset against it replacing the Testset id below with the Testset from before and the Scoring Config ID above:

async function runTests() {
  const run = await scorecard.run.create({
    testsetId, // Use the actual Testset ID
    scoringConfigId, // Use the actual Scoring Config ID
  });

  // Run all Testcases in parallel
  const testcases = await scorecard.testset.getTestcases(testsetId);
  await Promise.all(
    testcases.results.map(async (testcase) => {
      // Run model on the given Testcase
      const modelResponse = await answerQuery(testcase.userQuery);
      // Record model response in Scorecard
      await scorecard.testrecord.create(run.id, {
        testcaseId: testcase.id,
        testsetId: run.testsetId,
        userQuery: testcase.userQuery,
        context: testcase.context,
        ideal: testcase.ideal,
        response: modelResponse,
      });
    })
  );

  // Set run status to waiting for metric scoring
  await scorecard.run.updateStatus(run.id, {
    status: "awaiting_scoring",
  });

  console.log("Visit the Scorecard UI to score the run:");

  // Use your Testset's project ID here:
  console.log(
    `https://app.getscorecard.ai/projects/${projectId}/runs/grades/${run.id}`
  );

  return run;
}
6

Run Scoring

Now let’s review the outputs of our execution in Scorecard and run scoring by clicking on the “Run Scoring button’.

Scorecard UI: Run Scoring

7

View Results

Finally let’s review the results in the Scorecard UI. Here you can view and understand the performance of your LLM system:

Scorecard UI: Viewing Results