SDK Quickstart - Scorecard Docs

Scorecard has Python (v3.2.1) and TypeScript (v2.2.0) SDKs. If you’re using Python, you can follow along in Google Colab.

Steps

Setup accounts

Create a Scorecard account, then set your Scorecard API key as an environment variable.

export SCORECARD_API_KEY="your_scorecard_api_key"

Install Scorecard SDK

Install the Scorecard library:

pip install scorecard-ai

npm install scorecard-ai

Create simple LLM system to evaluate

For the quickstart, the LLM system is run_system(), a simple function that takes an input and returns an output.In Scorecard, system inputs and outputs are dictionaries. The function receives system_input and returns a dictionary.

Without OpenAI key
With OpenAI key

Here’s a simple system that does not require an OpenAI API key:

# Example:
# run_system({"original": "How are you?", "recipient": "team", "tone": "formal"})
# -> {"rewritten": "Hello team,\nHOW ARE YOU?"}
def run_system(system_input: dict) -> dict:
    recipient = system_input.get("recipient", "")
    return {
      "rewritten": f"Hello {recipient} in tone {system_input['tone']},\n{system_input['original'].upper()}"
    }

// Example:
// runSystem({"original": "How are you?", "recipient": "team", "tone": "formal"})
// -> {"rewritten": "Hello team,\nHOW ARE YOU?"}
async function runSystem(systemInput) {
  const recipient = systemInput.recipient ?? "";
  return {
    rewritten: `Hello ${recipient} in tone ${systemInput.tone},\n${systemInput.original.toUpperCase()}`
  };
}

If you have an OpenAI API key, you can create a realistic system.Install the OpenAI library:

pip install openai

npm install openai

Then, create the system:

from openai import OpenAI

# Find your API key at https://platform.openai.com/api-keys
openai = OpenAI(api_key="your_openai_api_key")

# Example:
# run_system({"original": "Hello, world!", "tone": "formal concise", "recipient": "team"})
# -> {"rewritten": "Greetings, team."}
def run_system(system_input: dict) -> dict:
    recipient = system_input.get('recipient','')
    response = openai.responses.create(
        model="gpt-4o-mini",
        instructions=f"You are a tone translator. Convert the user's message to the tone: {system_input['tone']}." + (f" Address the recipient: {recipient}" if recipient else ""),
        input=system_input['original'],
    )
    return {
        "rewritten": response.output_text
    }

import OpenAI from 'openai';

// By default, the API key is taken from the environment variable.
const openai = new OpenAI();

// Example:
// runSystem({"original": "Hello, world!", "tone": "formal concise", "recipient": "team"})
// -> {"rewritten": "Greetings, team."}
async function runSystem(systemInput) {
  const recipient = systemInput.recipient ?? "";
  const response = await openai.responses.create({
    model: 'gpt-4o-mini',
    instructions: `You are a tone translator. Convert the user's message to the tone: ${systemInput.tone}. ` +
      (systemInput.recipient ? `Address the recipient: ${systemInput.recipient}` : ""),
    input: systemInput.original,
  });
  return {
    rewritten: response.output_text
  };
}

Setup Scorecard

from scorecard_ai import Scorecard

# By default, the API key is taken from the environment variable.
scorecard = Scorecard()

import Scorecard from 'scorecard-ai';

// By default, the API key is taken from the environment variable.
const scorecard = new Scorecard();

Specify Project

Create a new Project in Scorecard, or use the existing default Project. This is where your testsets, metrics, and runs are stored.Set the Project ID for later:

PROJECT_ID = "123"  # Replace with your project ID

const PROJECT_ID = "123"; // Replace with your Project ID

Create test cases

Create some test cases to represent the inputs and the ideal (expected) outputs of your system.

testcases = [
    {
        # `inputs` gets passed to the system as a dictionary.
        "inputs": {
            "original": "We need your feedback on the new designs ASAP.",
            "tone": "polite",
        },
        # `expected` is the ideal output of the system used by the LLM-as-a-judge to evaluate the system.
        "expected": {
            "idealRewritten": "Hi, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.",
        },
    },
    {
        "inputs": {
            "original": "I'll be late to the office because my cat is sleeping on my keyboard.",
            "tone": "funny",
        },
        "expected": {
            "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
        },
    },
]

const testcases = [
  {
    // `inputs` gets passed to the system as an object.
    inputs: {
      original: 'We need your feedback on the new designs ASAP.',
      tone: 'polite',
    },
    // `expected` is the ideal output of the system used by the LLM-as-a-judge to evaluate the system.
    expected: {
      idealRewritten: 'Hi! your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.',
    },
  },
  {
    inputs: {
      original: "I'll be late to the office because my cat is sleeping on my keyboard.",
      tone: 'funny',
    },
    expected: {
      idealRewritten: "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
    },
  },
];

Create Metrics

Create two LLM-as-a-judge Metrics to evaluate whether your system uses the correct tone and addresses the recipient.The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}} with the test case’s tone value.

import textwrap

tone_accuracy_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Tone accuracy",
    description="How well does it match the intended tone?",
    eval_type="ai",
    output_type="int",
    prompt_template=textwrap.dedent("""
      You are a tone evaluator. Grade the response on how well it matches the
      intended tone: "{{inputs.tone}}". Use a score of 1 if the tones are very
      different and 5 if they are the exact same.
      
      Response: {{outputs.rewritten}}
      
      Ideal response: {{expected.idealRewritten}}
      
      {{ gradingInstructionsAndExamples }}"""),
)

recipient_address_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Recipient address",
    description="Does it address the recipient only if specified?",
    eval_type="ai",
    output_type="boolean",
    prompt_template=textwrap.dedent("""
      {% if inputs.recipient %}
        Does the response refer to the correct recipient: {{inputs.recipient}}?
        Response: {{outputs.rewritten}}
      {% else %}
        The response should avoid referring to any specific recipient.
        Response: {{outputs.rewritten}}
      {% endif %}
      
      {{ gradingInstructionsAndExamples }}"""),
)

const toneAccuracyMetric = await scorecard.metrics.create({
  projectId: PROJECT_ID,
  name: "Tone accuracy",
  evalType: "ai",
  outputType: "int",
  promptTemplate: "You are a tone evaluator. Grade the response on how well it matches the intended tone {{inputs.tone}} and the tone of the ideal response. Use a score of 1 if the tones are very different and 5 if they are the exact same.\n\nResponse: {{outputs.rewritten}}\n\nIdeal response: {{expected.idealRewritten}}\n\n{{ gradingInstructionsAndExamples }}",
});

const recipientAddressMetric = await scorecard.metrics.create({
  projectId: PROJECT_ID,
  name: "Recipient address",
  evalType: "ai",
  outputType: "boolean",
  promptTemplate: "Does the response refer to the correct recipient: {{inputs.recipient}}? Response: {{outputs.rewritten}}\n\n{{ gradingInstructionsAndExamples }}",
});

Evaluate system

Call run_system() against the test cases and record the scored results in Scorecard.

from scorecard_ai.lib import run_and_evaluate

run = run_and_evaluate(
    client=scorecard,
    project_id=PROJECT_ID,
    testcases=testcases,
    metric_ids=[tone_accuracy_metric.id, recipient_address_metric.id],
    system=lambda input, _system_version: run_system(input),
)

print(f'Go to {run["url"]} to view your scored results.')

import Scorecard, { runAndEvaluate } from 'scorecard-ai';

const run = await runAndEvaluate(scorecard, {
  projectId: PROJECT_ID,
  testcases: testcases,
  metricIds: [toneAccuracyMetric.id, recipientAddressMetric.id],
  system: runSystem,
});
console.log(`Go to ${run.url} to view your scored results.`);

Analyze results

Finally, review the results in Scorecard to understand the performance of your system.

Where to go next

Tracing Quickstart — Add observability to your LLM calls
SDK + Tracing — Combine evaluation data with trace observability in a single record

​Steps

​Where to go next

Steps

Where to go next