Quickstart
Evaluate a simple LLM system with Scorecard in minutes.
Scorecard has Python, JavaScript, and Go SDKs. If you’re using Python, you can follow along in Google Colab.
Steps
Setup accounts
Create a Scorecard account and an OpenAI account, then set your Scorecard API key and OpenAI API key as environment variables.
Install SDKs
Install the Scorecard and OpenAI libraries:
Create simple LLM system
Create a simple LLM system to evaluate. When using Scorecard, systems use dictionaries as input and output.
For the quickstart, the LLM system is run_system()
, which translates the user’s message to a different tone.
input["original"]
is the user’s message and input["tone"]
is the tone to translate to. The output is a dictionary containing the translated message (rewritten
).
Setup Scorecard
Create Project
Create a Project in Scorecard. This will be where your tests and runs will be stored. Copy the Project ID for later.
Create Testset with Testcases
Create a Testset in the Project and add some Testcases. A Testset is a collection of Testcases used to evaluate the performance of an LLM system. A Testcase is a single input and ideal output pair that is used for scoring. Copy the Testset ID for later.
Run this code to create a Testset with a schema matching our tone translator app.
Create Metrics
From the Scorecard UI, create an AI-graded Metric named “Tone accuracy” to evaluate the tone translator system on. Copy the Metric ID for later.
Creating a Metric in the Scorecard UI.
Evaluate system
Run the system against the Testset and Metrics you’ve created and record the results in Scorecard.
Run scoring
Click the link in the output above, or find the Run in the Scorecard UI. On the Run page, click the “Run Scoring” button to score your system using the Metric you created.
Analyze results
Finally, review the results in Scorecard to understand the performance of the tone translator system.
Viewing results in the Scorecard UI.