Scorecard has Python, JavaScript, and Go SDKs. If you’re using Python, you can follow along in Google Colab.

Steps

1

Setup accounts

Create a Scorecard account and an OpenAI account, then set your Scorecard API key and OpenAI API key as environment variables.

export SCORECARD_API_KEY="your_scorecard_api_key"
export OPENAI_API_KEY="your_openai_api_key"
2

Install SDKs

Install the Scorecard and OpenAI libraries:

pip install --pre scorecard-ai
pip install openai
3

Create simple LLM system

Create a simple LLM system to evaluate. When using Scorecard, systems use dictionaries as input and output.

For the quickstart, the LLM system is run_system(), which translates the user’s message to a different tone.

input["original"] is the user’s message and input["tone"] is the tone to translate to. The output is a dictionary containing the translated message (rewritten).

from openai import OpenAI

# By default, the API key is taken from the environment variable.
openai = OpenAI()

# Example:
# run_system({"original": "Hello, world!", "tone": "formal"})
# -> {"rewritten": "Greetings, world."}
def run_system(system_input: dict) -> dict:
    response = openai.responses.create(
        model="gpt-4o-mini",
        instructions=f"You are a tone translator. Convert the user's message to the tone: {system_input['tone']}",
        input=system_input['original'],
    )
    return {
        "rewritten": response.output_text
    }
4

Setup Scorecard

from scorecard_ai import Scorecard

# By default, the API key is taken from the environment variable.
scorecard = Scorecard()
5

Create Project

Create a Project in Scorecard. This will be where your tests and runs will be stored. Copy the Project ID for later.

PROJECT_ID = "310"  # Replace with your project ID
6

Create Testset with Testcases

Create a Testset in the Project and add some Testcases. A Testset is a collection of Testcases used to evaluate the performance of an LLM system. A Testcase is a single input and ideal output pair that is used for scoring. Copy the Testset ID for later.

Run this code to create a Testset with a schema matching our tone translator app.

# Create a Testset with a schema matching our use case
testset = scorecard.testsets.create(
    project_id=PROJECT_ID,
    name="Tone rewriter testset",
    description="Testcases about rewriting messages in a different tone.",
    field_mapping={
        # Inputs represent the input to the system.
        "inputs": ["original", "tone"],
        # Labels represent the expected output of the system.
        "labels": ["idealRewritten"],
        # Metadata fields are used for grouping Testcases, but not seen by the system.
        "metadata": [],
    },
    json_schema={
        "type": "object",
        "properties": {
            # The original message.
            "original": {"type": "string"},
            # The tone that the message should be rewritten in.
            "tone": {"type": "string"},
            # The ideal AI-generated rewritten message.
            "idealRewritten": {"type": "string"},
        },
        "required": ["original", "tone", "idealRewritten"],
    },
)

# Add Testcases matching the Testset's schema to the Testset
scorecard.testcases.create(
    testset_id=testset.id,
    items=[
        {
            "json_data": {
                "original": "We need your feedback on the new designs ASAP.",
                "tone": "polite",
                "idealRewritten": "Hi, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.",
            },
        },
        {
            "json_data": {
                "original": "I'll be late to the office because my cat is sleeping on my keyboard.",
                "tone": "funny",
                "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
            },
        },
        {
            "json_data": {
                "original": "Schedule a meeting to discuss this project.",
                "tone": "casual",
                "idealRewritten": "Let's find a time to chat about the project. Coffee or boba?",
            },
        },
    ],
)

TESTSET_ID = testset.id
7

Create Metrics

From the Scorecard UI, create an AI-graded Metric named “Tone accuracy” to evaluate the tone translator system on. Copy the Metric ID for later.

Creating a Metric in the Scorecard UI.

METRIC_IDS = ["987"]  # Replace with your Metric ID
8

Evaluate system

Run the system against the Testset and Metrics you’ve created and record the results in Scorecard.

from scorecard_ai.lib import run_and_evaluate

run = run_and_evaluate(
    client=scorecard,
    project_id=PROJECT_ID,
    testset_id=TESTSET_ID,
    metric_ids=METRIC_IDS,
    system=lambda input: run_system(input)
)

print(f'Go to {run.url} and click "Run Scoring" to grade your Run.')
9

Run scoring

Click the link in the output above, or find the Run in the Scorecard UI. On the Run page, click the “Run Scoring” button to score your system using the Metric you created.

10

Analyze results

Finally, review the results in Scorecard to understand the performance of the tone translator system.

Viewing results in the Scorecard UI.