Scorecard has Python and JavaScript SDKs. If you’re using Python, you can follow along in Google Colab.

Steps

1

Setup accounts

Create a Scorecard account and an OpenAI account, then set your Scorecard API key and OpenAI API key as environment variables.

export SCORECARD_API_KEY="your_scorecard_api_key"
export OPENAI_API_KEY="your_openai_api_key"
2

Install SDKs

Install the Scorecard and OpenAI libraries:

pip install scorecard-ai
pip install openai
3

Create simple LLM system

Create a simple LLM system to evaluate. When using Scorecard, systems use dictionaries as input and output.

For the quickstart, the LLM system is run_system(), which translates the user’s message to a different tone.

input["original"] is the user’s message and input["tone"] is the tone to translate to. The output is a dictionary containing the translated message (rewritten).

from openai import OpenAI

# By default, the API key is taken from the environment variable.
openai = OpenAI()

# Example:
# run_system({"original": "Hello, world!", "tone": "formal"})
# -> {"rewritten": "Greetings, world."}
def run_system(system_input: dict) -> dict:
    response = openai.responses.create(
        model="gpt-4o-mini",
        instructions=f"You are a tone translator. Convert the user's message to the tone: {system_input['tone']}",
        input=system_input['original'],
    )
    return {
        "rewritten": response.output_text
    }
4

Setup Scorecard

from scorecard_ai import Scorecard

# By default, the API key is taken from the environment variable.
scorecard = Scorecard()
5

Create Project

Create a Project in Scorecard. This will be where your tests and runs will be stored. Copy the Project ID for later.

PROJECT_ID = "310"  # Replace with your project ID
6

Create Testset with Testcases

Create some testcases to represent the inputs to your system and the ideal (expected) outputs.

testcases = [
    {
        # `inputs` gets passed to the system.
        "inputs": {
            "original": "We need your feedback on the new designs ASAP.",
            "tone": "polite",
        },
        # `expected` is the ideal output of the system used by the LLM-as-a-judge to evaluate the system.
        "expected": {
            "idealRewritten": "Hi, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.",
        },
    },
    {
        "inputs": {
            "original": "I'll be late to the office because my cat is sleeping on my keyboard.",
            "tone": "funny",
        },
        "expected": {
            "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
        },
    },
]
7

Create Metrics

Create an LLM-as-a-judge Metric to evaluate the tone accuracy of your system.

The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}} with the Testcase’s tone value.

metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Tone accuracy",
    eval_type="ai",
    output_type="int",
    prompt_template="You are a tone evaluator. Grade the response on how well it matches the intended tone {{inputs.tone}} and the tone of the ideal response. Use a score of 1 if the tones are very different and 5 if they are the exact same.\n\nResponse: {{output.rewritten}}\n\nIdeal response: {{expected.idealRewritten}}",
)
8

Evaluate system

Run the system against the Metrics you’ve created and record the results in Scorecard.

from scorecard_ai.lib import run_and_evaluate

run = run_and_evaluate(
    client=scorecard,
    project_id=PROJECT_ID,
    metric_ids=[metric.id],
    system=lambda input: run_system(input),
    testcases=testcases,
)
print(f'Go to {run["url"]} to view your results.')
9

Analyze results

Finally, review the results in Scorecard to understand the performance of the tone translator system.

Viewing results in the Scorecard UI.