Evaluate a simple LLM system with Scorecard in minutes.
Scorecard has Python and JavaScript SDKs. If you’re using Python, you can follow along in Google Colab.
Setup accounts
Create a Scorecard account, then set your Scorecard API key as an environment variable.
Install Scorecard SDK
Install the Scorecard library:
Create simple LLM system to evaluate
For the quickstart, the LLM system is run_system()
, which rewrites the user’s message into a different tone and optionally addresses the recipient.
In Scorecard, system inputs and outputs are dictionaries, so:
system_input["original"]
is the user’s message.system_input["tone"]
is the tone to translate to.system_input["recipient"]
(optional) is the recipient to address.{ "rewritten": "..." }
Here’s a simple system that does not require an OpenAI API key:
Setup Scorecard
Specify Project
Create a new Project in Scorecard, or use the existing default Project. This is where your testsets, metrics, and runs are stored.
Set the Project ID for later:
Create test cases
Create some test cases to represent the inputs and the ideal (expected
) outputs of your tone translator system.
Create Metrics
Create two LLM-as-a-judge Metrics to evaluate whether your system uses the correct tone and addresses the recipient.
The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}}
with the test case’s tone
value.
Evaluate system
Call run_system()
against the test cases and record the scored results in Scorecard.
Analyze results
Finally, review the results in Scorecard to understand the performance of the tone translator system.
Viewing results in the Scorecard UI.