For the quickstart, the LLM system is run_system(), a simple function that takes an input and returns an output.In Scorecard, system inputs and outputs are dictionaries. The function receives system_input and returns a dictionary.
Without OpenAI key
With OpenAI key
Here’s a simple system that does not require an OpenAI API key:
# Example:# run_system({"original": "How are you?", "recipient": "team", "tone": "formal"})# -> {"rewritten": "Hello team,\nHOW ARE YOU?"}def run_system(system_input: dict) -> dict: recipient = system_input.get("recipient", "") return { "rewritten": f"Hello {recipient} in tone {system_input['tone']},\n{system_input['original'].upper()}" }
If you have an OpenAI API key, you can create a realistic system.Install the OpenAI library:
pip install openai
Then, create the system:
from openai import OpenAI# Find your API key at https://platform.openai.com/api-keysopenai = OpenAI(api_key="your_openai_api_key")# Example:# run_system({"original": "Hello, world!", "tone": "formal concise", "recipient": "team"})# -> {"rewritten": "Greetings, team."}def run_system(system_input: dict) -> dict: recipient = system_input.get('recipient','') response = openai.responses.create( model="gpt-4o-mini", instructions=f"You are a tone translator. Convert the user's message to the tone: {system_input['tone']}." + (f" Address the recipient: {recipient}" if recipient else ""), input=system_input['original'], ) return { "rewritten": response.output_text }
4
Setup Scorecard
from scorecard_ai import Scorecard# By default, the API key is taken from the environment variable.scorecard = Scorecard()
5
Specify Project
Create a new Project in Scorecard, or use the existing default Project. This is where your testsets, metrics, and runs are stored.Set the Project ID for later:
PROJECT_ID = "123" # Replace with your project ID
6
Create test cases
Create some test cases to represent the inputs and the ideal (expected) outputs of your system.
testcases = [ { # `inputs` gets passed to the system as a dictionary. "inputs": { "original": "We need your feedback on the new designs ASAP.", "tone": "polite", }, # `expected` is the ideal output of the system used by the LLM-as-a-judge to evaluate the system. "expected": { "idealRewritten": "Hi, your feedback is crucial to the success of the new designs. Please share your thoughts as soon as possible.", }, }, { "inputs": { "original": "I'll be late to the office because my cat is sleeping on my keyboard.", "tone": "funny", }, "expected": { "idealRewritten": "Hey team! My cat's napping on my keyboard and I'm just waiting for her to give me permission to leave. I'll be a bit late!", }, },]
7
Create Metrics
Create two LLM-as-a-judge Metrics to evaluate whether your system uses the correct tone and addresses the recipient.The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}} with the test case’s tone value.
import textwraptone_accuracy_metric = scorecard.metrics.create( project_id=PROJECT_ID, name="Tone accuracy", description="How well does it match the intended tone?", eval_type="ai", output_type="int", prompt_template=textwrap.dedent(""" You are a tone evaluator. Grade the response on how well it matches the intended tone: "{{inputs.tone}}". Use a score of 1 if the tones are very different and 5 if they are the exact same. Response: {{outputs.rewritten}} Ideal response: {{expected.idealRewritten}} {{ gradingInstructionsAndExamples }}"""),)recipient_address_metric = scorecard.metrics.create( project_id=PROJECT_ID, name="Recipient address", description="Does it address the recipient only if specified?", eval_type="ai", output_type="boolean", prompt_template=textwrap.dedent(""" {% if inputs.recipient %} Does the response refer to the correct recipient: {{inputs.recipient}}? Response: {{outputs.rewritten}} {% else %} The response should avoid referring to any specific recipient. Response: {{outputs.rewritten}} {% endif %} {{ gradingInstructionsAndExamples }}"""),)
8
Evaluate system
Call run_system() against the test cases and record the scored results in Scorecard.
from scorecard_ai.lib import run_and_evaluaterun = run_and_evaluate( client=scorecard, project_id=PROJECT_ID, testcases=testcases, metric_ids=[tone_accuracy_metric.id, recipient_address_metric.id], system=lambda input, _system_version: run_system(input),)print(f'Go to {run["url"]} to view your scored results.')
9
Analyze results
Finally, review the results in Scorecard to understand the performance of your system.