Python
Scorecard helps you evaluate the performance of your LLM app to ship faster with more confidence!
In this quickstart we will:
- Setup Scorecard
- Create a Testset
- Create an example LLM app with OpenAI
- Define the evaluation setup
- Score the LLM app with the Testset
- Review evaluation results in the Scorecard UI
Follow along in the Google Colab notebook.
Steps
Setup
First let’s create a Scorecard account and find the SCORECARD_API_KEY
in the settings. Since this example creates a simple LLM application using OpenAI, get an OpenAI API key. Set both API keys as environment variables as shown below. Additionally, install the Scorecard and OpenAI Python libraries:
Setup constants
Configure the Scorecard client and enter your Project’s ID. We’ll need that to know which Project to place our Testsets and Runs in.
Create a Testset and Add Testcases
Now, let’s create a Testset in a specific project and add some Testcases using the SDK. A Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios. A Testcase is a single input to an LLM that is used for scoring. After we create a Testset, we’ll grab its ID to use later.
Create a Simple LLM App
Next, let’s create a simple LLM application which we will be evaluating using Scorecard. This LLM application is represented with the following function that uses OpenAI’s GPT-4o-mini to translate the user’s message into a specified tone.
Create Metrics
Now that we have a system that answers questions from the MMLU dataset, let’s build a metric to understand how relevant the system responses are to our user query. Let’s go to the Scoring Lab and select “New Metric”
Scorecard UI: New Metric
From here let’s create a metric for answer relevency:
Scorecard UI: Metric Definition
You can evaluate your LLM systems with one or multiple metrics. For the quick start, let’s just use that Answer Relevancy metric and grab the Metric ID for later.
Create Test System
Now let’s use our mock system and run our Testset against it replacing the Testset id below with the Testset from before and the Scoring Config ID above:
Run Scoring
Now let’s review the outputs of our execution in the Scorecard UI and run scoring by clicking on the “Run Scoring” button.
Scorecard UI: Run Scoring
View Results
Finally let’s review the results in the Scorecard UI. Here you can view and understand the performance of your LLM system:
Scorecard UI: Viewing Results