In this quickstart we will:

  • Setup Scorecard
  • Create a Testset
  • Create an example LLM app with OpenAI
  • Define the evaluation setup
  • Score the LLM app with the Testset
  • Review evaluation results in the Scorecard UI

Follow along in the Google Colab notebook.

Steps

1

Setup

First let’s create a Scorecard account and find the SCORECARD_API_KEY in the settings. Since this example creates a simple LLM application using OpenAI, get an OpenAI API key. Set both API keys as environment variables as shown below. Additionally, install the Scorecard and OpenAI Python libraries:

Python Setup
export SCORECARD_API_KEY="SCORECARD_API_KEY"
export OPENAI_API_KEY="OPENAI_API_KEY"
pip install --pre scorecard-ai
pip install openai
2

Setup constants

Configure the Scorecard client and enter your Project’s ID. We’ll need that to know which Project to place our Testsets and Runs in.

from scorecard_ai import Scorecard

scorecard = Scorecard(bearer_token=SCORECARD_API_KEY)

PROJECT_ID = "310"  # Replace with your Project id
3

Create a Testset and Add Testcases

Now, let’s create a Testset in a specific project and add some Testcases using the SDK. A Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios. A Testcase is a single input to an LLM that is used for scoring. After we create a Testset, we’ll grab its ID to use later.

# Create a Testset with a schema matching our use case
testset = scorecard.testsets.create(
    project_id=PROJECT_ID,
    name="Tone rewriter testset",
    description="Testcases about rewriting messages in a different tone.",
    field_mapping={
        # Inputs are fields that represent the input to the AI system.
        "inputs": ["original", "recipient", "tone"],
        # Labels are fields represent the expected output of the AI system.
        "labels": ["idealRewritten"],
        # Metadata fields are used for grouping Testcases, but not seen by the AI system.
        "metadata": [],
    },
    json_schema={
        "type": "object",
        "properties": {
            # The original message.
            "original": {"type": "string"},
            # The recipient of the message.
            "recipient": {"type": "string"},
            # The tone that the message should be rewritten in.
            "tone": {"type": "string"},
            # The ideal AI-generated rewritten message.
            "idealRewritten": {"type": "string"},
        },
        "required": ["original", "tone", "idealRewritten"],
    },
)

# Add Testcases matching the Testset's schema to the Testset
scorecard.testcases.create(
    testset_id=testset.id,
    items=[
        {
            "json_data": {
                "original": "We need your feedback on the new designs ASAP.",
                "tone": "polite",
                "recipient": "Darius",
                "idealRewritten": "Hi Darius, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.",
            },
        },
        {
            "json_data": {
                "original": "I'll be late to the office because my cat is sleeping on my keyboard.",
                "tone": "funny",
                "recipient": "team",
                "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
            },
        },
    ],
)

print("Visit the Scorecard UI to view your Testset:")
print(f"https://app.getscorecard.ai/projects/{PROJECT_ID}/testsets/{testset.id}")
4

Create a Simple LLM App

Next, let’s create a simple LLM application which we will be evaluating using Scorecard. This LLM application is represented with the following function that uses OpenAI’s GPT-4o-mini to translate the user’s message into a specified tone.

from openai import OpenAI

openai = OpenAI(api_key=OPENAI_API_KEY)

# The "system under test" -- the AI system that you want to evaluate.
def run_system(system_input: dict) -> dict:
    response = openai.responses.create(
        model="gpt-4o-mini",
        instructions=f"You are a tone translator that converts a user's message to a different tone ({system_input['tone']}). Address the recipient: {system_input.get('recipient')}",
        input=system_input['original'],
    )
    return {
        "rewritten": response.output_text
    }
5

Create Metrics

Now that we have a system that answers questions from the MMLU dataset, let’s build a metric to understand how relevant the system responses are to our user query. Let’s go to the Scoring Lab and select “New Metric”

Scorecard UI: New Metric

From here let’s create a metric for answer relevency:

Scorecard UI: Metric Definition

You can evaluate your LLM systems with one or multiple metrics. For the quick start, let’s just use that Answer Relevancy metric and grab the Metric ID for later.

METRIC_IDS = ["987"]  # Replace with your Metric ID(s)
6

Create Test System

Now let’s use our mock system and run our Testset against it replacing the Testset id below with the Testset from before and the Scoring Config ID above:

# Create a new Run on the Testset with the given Metrics.
run = scorecard.runs.create(project_id=PROJECT_ID, testset_id=testset.id, metric_ids=METRIC_IDS)

# Run the system on each Testcase.
for testcase in scorecard.testcases.list(testset.id):
    outputs = run_system(testcase.inputs)
    scorecard.records.create(run_id=run.id, testcase_id=testcase.id, inputs=testcase.inputs, labels=testcase.labels, outputs=outputs)

# Mark the Run as done with execution and ready for scoring.
scorecard.runs.update(run_id=run.id, status="awaiting_scoring")

run_url = f"https://app.getscorecard.ai/projects/{PROJECT_ID}/runs/grades/{run.id}"
print(f"Go to {run_url} and click \"Run Scoring\" to grade your Records.")
7

Run Scoring

Now let’s review the outputs of our execution in the Scorecard UI and run scoring by clicking on the “Run Scoring” button.

Scorecard UI: Run Scoring

8

View Results

Finally let’s review the results in the Scorecard UI. Here you can view and understand the performance of your LLM system:

Scorecard UI: Viewing Results