Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application.
When starting out with a LLM application, you begin with a “vibe check” by running a few prompts manually to see if the responses make sense. Then, you take it a step further by manually testing a set of your favorite prompts after each LLM iteration. However, you might wonder how to quantify and consistently test the quality of LLM responses. This is when you need metrics. A metric is a standard of measurement. In LLM evaluation, a metric serves as a benchmark for assessing the quality of LLM responses.
Scorecard assists you in several ways in defining metrics to use for LLM evaluation. Let’s check out some of the advantages that Scorecard offers!
The Scorecard team consists of experts in LLM evaluation, with extensive experience in assessing and deploying large-scale AI applications at some of the world’s leading companies. Our Scorecard Core Metrics, validated by our team, represent industry-standard benchmarks for LLM performance. You can explore these Core Metrics in the Scoring Lab, under “Metric Templates,” to select the ones best suited for your LLM evaluation needs.
Overview of Scorecard Core Metrics in Metric Templates
If you would like to use a specific Core Metric, simply make a copy of it and, if needed, edit its details before importing it into your metrics section.
Metric Template to Copy
Scorecard provides support for running MLflow metrics directly within the Scorecard platform and provides complementary additional capabilities such as aggregation, A/B comparison, iteration and more. Several of MLflow’s metrics of the genai
package are available in the Scorecard metrics library. These metrics include:
You can view the full prompts for these metrics in the Scorecard Scoring lab or in the MLflow GitHub repository.
You can also utilize metrics from the RAGAS framework, which is specialized in evaluating Retrieval Augmented Generation (RAG) pipelines. Scorecard also provides complementary additional capabilities such as aggregation, A/B comparison, iteration, and more.
To assess the performance of individual components within a RAG pipeline, you can leverage metrics such as:
To assess the end-to-end performance of a RAG pipeline, you can leverage metrics such as:
You can view the full prompts for these metrics in the Scorecard Scoring lab or in the RAGAS GitHub repository.
Adding a New Custom Metric in Scoring Lab
Scorecard also offers the flexibility to define any customized metric unique to your LLM application. At the top of the metrics section in the Scoring Lab, click the ”+ New Metric” button and fill out the details of your metric. Information needed includes:
Metric Name: a human-readable name of your metric.
Metric Guidelines: natural language instructions to define how a metric should be computed.
Evaluation Type: how your metric will be evaluated
Output Type: the output type of your metric.
New Custom Metric
Once you have defined all metrics that you need to properly assess the quality of your LLM application, you can group them together to form of Scoring Config. A Scorecard Scoring Config is a collection of metrics that are used together to score an LLM application. This Scoring Config can be routinely run against your LLM application to yield a consistent measure of the performance and quality of its responses. Instead of manually selecting multiple metrics to score an LLM application every time for a particular use case, defining this set of metrics makes it easy to repeatedly score in the same way with multiple metrics. Different Scoring Configs can serve different purposes by ensuring consistency in testing, e.g. a Scoring Config for RAG applications, a Scoring Config for translation use cases, etc.
Overview of Scoring Configs in Scoring Lab
Have an overview of your existing Scoring Configs in the Scoring Lab under “Scoring Configs”. In this tab, you can also define a new Scoring Config by providing the following information:
Defining a New Scoring Config
After specifying a name and description, select metrics by clicking on “Select Metrics”. On the next page, select one or multiple metrics to use for the Scoring Config.
Selecting Metrics for a New Scoring Config