Scorecard has Python and JavaScript SDKs. If you’re using Python, you can follow along in Google Colab.

Steps

1

Setup accounts

Create a Scorecard account, then set your Scorecard API key as an environment variable.

export SCORECARD_API_KEY="your_scorecard_api_key"
2

Install Scorecard SDK

Install the Scorecard library:

pip install scorecard-ai
3

Create simple LLM system to evaluate

For the quickstart, the LLM system is run_system(), which rewrites the user’s message into a different tone and optionally addresses the recipient.

In Scorecard, system inputs and outputs are dictionaries, so:

  • system_input["original"] is the user’s message.
  • system_input["tone"] is the tone to translate to.
  • system_input["recipient"] (optional) is the recipient to address.
  • The output contains the rewritten message, e.g. { "rewritten": "..." }

Here’s a simple system that does not require an OpenAI API key:


# Example:
# run_system({"original": "How are you?", "recipient": "team", "tone": "formal"})
# -> {"rewritten": "Hello team,\nHOW ARE YOU?"}
def run_system(system_input: dict) -> dict:
    recipient = system_input.get("recipient", "")
    return {
      "rewritten": f"Hello {recipient} in tone {system_input['tone']},\n{system_input['original'].upper()}"
    }
4

Setup Scorecard

from scorecard_ai import Scorecard

# By default, the API key is taken from the environment variable.
scorecard = Scorecard()
5

Specify Project

Create a new Project in Scorecard, or use the existing default Project. This is where your testsets, metrics, and runs are stored.

Set the Project ID for later:

PROJECT_ID = "123"  # Replace with your project ID
6

Create test cases

Create some test cases to represent the inputs and the ideal (expected) outputs of your tone translator system.

testcases = [
    {
        # `inputs` gets passed to the system as a dictionary.
        "inputs": {
            "original": "We need your feedback on the new designs ASAP.",
            "tone": "polite",
        },
        # `expected` is the ideal output of the system used by the LLM-as-a-judge to evaluate the system.
        "expected": {
            "idealRewritten": "Hi, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.",
        },
    },
    {
        "inputs": {
            "original": "I'll be late to the office because my cat is sleeping on my keyboard.",
            "tone": "funny",
        },
        "expected": {
            "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!",
        },
    },
]
7

Create Metrics

Create two LLM-as-a-judge Metrics to evaluate whether your system uses the correct tone and addresses the recipient.

The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}} with the test case’s tone value.

import textwrap

tone_accuracy_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Tone accuracy",
    description="How well does it match the intended tone?",
    eval_type="ai",
    output_type="int",
    prompt_template=textwrap.dedent("""
      You are a tone evaluator. Grade the response on how well it matches the
      intended tone: "{{inputs.tone}}". Use a score of 1 if the tones are very
      different and 5 if they are the exact same.
      
      Response: {{outputs.rewritten}}
      
      Ideal response: {{expected.idealRewritten}}
      
      {{ gradingInstructionsAndExamples }}"""),
)

recipient_address_metric = scorecard.metrics.create(
    project_id=PROJECT_ID,
    name="Recipient address",
    description="Does it address the recipient only if specified?",
    eval_type="ai",
    output_type="boolean",
    prompt_template=textwrap.dedent("""
      {% if inputs.recipient %}
        Does the response refer to the correct recipient: {{inputs.recipient}}?
        Response: {{outputs.rewritten}}
      {% else %}
        The response should avoid referring to any specific recipient.
        Response: {{outputs.rewritten}}
      {% endif %}
      
      {{ gradingInstructionsAndExamples }}"""),
)
8

Evaluate system

Call run_system() against the test cases and record the scored results in Scorecard.

from scorecard_ai.lib import run_and_evaluate

run = run_and_evaluate(
    client=scorecard,
    project_id=PROJECT_ID,
    testcases=testcases,
    metric_ids=[tone_accuracy_metric.id, recipient_address_metric.id],
    system=lambda input, _system_version: run_system(input),
)

print(f'Go to {run["url"]} to view your scored results.')
9

Analyze results

Finally, review the results in Scorecard to understand the performance of the tone translator system.

Screenshot of viewing results in the Scorecard UI.Screenshot of viewing results in the Scorecard UI.

Viewing results in the Scorecard UI.