Quickstart
Evaluate a simple LLM system with Scorecard in minutes.
Scorecard has Python and JavaScript SDKs. If you’re using Python, you can follow along in Google Colab.
Steps
Setup accounts
Create a Scorecard account and an OpenAI account, then set your Scorecard API key and OpenAI API key as environment variables.
Install SDKs
Install the Scorecard and OpenAI libraries:
Create simple LLM system
Create a simple LLM system to evaluate. When using Scorecard, systems use dictionaries as input and output.
For the quickstart, the LLM system is run_system()
, which translates the user’s message to a different tone.
input["original"]
is the user’s message and input["tone"]
is the tone to translate to. The output is a dictionary containing the translated message (rewritten
).
Setup Scorecard
Create Project
Create a Project in Scorecard. This will be where your tests and runs will be stored. Copy the Project ID for later.
Create Testset with Testcases
Create some testcases to represent the inputs to your system and the ideal (expected
) outputs.
Create Metrics
Create an LLM-as-a-judge Metric to evaluate the tone accuracy of your system.
The Metric’s prompt template uses Jinja syntax. For each Testcase, we will send the prompt template to the judge and replace {{inputs.tone}}
with the Testcase’s tone
value.
Evaluate system
Run the system against the Metrics you’ve created and record the results in Scorecard.
Analyze results
Finally, review the results in Scorecard to understand the performance of the tone translator system.
Viewing results in the Scorecard UI.