Questions? Contact Us

Here you find guides and resources to help evaluate your LLM applications with Scorecard.

Overview

Scorecard AI Docs

Evaluate a simple LLM system with Scorecard in minutes.

Quickstart

Systems: Beyond Simple Prompts

Scorecard is a platform to support product teams and engineers build, evaluate and deploy Large Language Model (LLM) applications to production.

What Is Scorecard?

Let’s walk through the journey of an LLM Developer, from start to finish. Buckle up! 🤠

The LLM Developer's Journey

Technical System Overview

Projects

The Playground is where you test and refine your AI prompts using real data. It's a three-panel interface that lets you select test data, edit prompts with Jinja templating, configure AI models, and see results in real-time.

Playground

Prompt Management

Tracing

Preserving the privacy of our clients and ensuring secure processes is of top priority at Scorecard.

Privacy By Design

Automated Scoring

Best In Class Metrics

Imagine these scenarios: The performance of your LLM application drops suddenly, you lose track of which LLM calls you have made, or you do not notice potential security leaks. 

Logging

A Scorecard Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios.

Testset Management

During the development of LLM applications, it is common practice to iteratively adjust the system to find the optimal setup that produces the best results. 

A/B Comparison

In order to develop controllable and safe LLM applications, you can integrate Scorecard with the open-source toolkits [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) and [Guardrails AI](https://github.com/guardrails-ai/guardrails). 

AI Guardrails

Run Inspection

Scorecard leverages the open source AI Proxy LiteLLM to allow you to manage multiple LLM APIs and easily swap out different AI models.

AI Proxy

Welcome to the Scorecard Cookbooks. Here you’ll find code and guides for accomplishing common tasks with the Scorecard API and SDKs.

Cookbooks

Retrieval Augmented Generation (RAG) is the use of retrieval methods (e.g. via search and vector stores) to provide generative models with additional context, or *grounding*.

RAG: Retrieval Augmented Generation

Ci cd with github actions

Browse the API documentation and integrate with Scorecard's API endpoints.

API Reference

Create Project

Retrieve a paginated list of all Projects. Projects are ordered by creation date, with oldest Projects first.

List Projects

Create a new Testset for a Project. The Testset will be created in the Project specified in the path.

Create Testset

Retrieve a paginated list of Testsets belonging to a Project.

List Testsets in Project

Get Testset

Update a Testset. Only the fields provided in the request body will be updated.
If a field is provided, the new content will replace the existing content.
If a field is not provided, the existing content will remain unchanged.

When updating the schema:
- If field mappings are not provided and existing mappings reference fields that no longer exist, those mappings will be automatically removed
- To preserve all existing mappings, ensure all referenced fields remain in the updated schema
- For complete control, provide both schema and fieldMapping when updating the schema

Update Testset

Delete Testset

Create multiple Testcases in the specified Testset.

Create multiple Testcases

Retrieve a paginated list of Testcases belonging to a Testset.

List Testcases in Testset

Get Testcase

Replace the data of an existing Testcase while keeping its ID.

Update Testcase

Delete multiple Testcases

Create a new Metric for evaluating system outputs. The structure of a metric depends on the evalType and outputType of the metric.

Create Metric

Update an existing Metric. You must specify the evalType and outputType of the metric. The structure of a metric depends on the evalType and outputType of the metric.

Update Metric

Create Run

Create Record

Create or update a Score for a given Record and MetricConfig. If a Score with the specified Record ID and MetricConfig ID already exists, it will be updated. Otherwise, a new Score will be created. The score provided should conform to the schema defined by the MetricConfig; otherwise, validation errors will be reported.

Upsert Score

Create a new system definition that specifies the interface contracts for a component you want to evaluate.

A system acts as a template that defines three key contracts through JSON Schemas:
1. Input Schema: What data your system accepts (e.g., user queries, context documents)
2. Output Schema: What data your system produces (e.g., responses, confidence scores)
3. Config Schema: What parameters can be adjusted (e.g., model selection, temperature)

This separation lets you evaluate any system as a black box, focusing on its interface rather than implementation details.

Create system

Retrieve a paginated list of all systems. Systems are ordered by creation date.

List systems

Get system

Update an existing system definition. Only the fields provided in the request body will be updated.
If a field is provided, the new content will replace the existing content.
If a field is not provided, the existing content will remain unchanged.

When updating schemas:
- The system will accept your changes regardless of compatibility with existing configurations
- Schema updates won't invalidate existing evaluations or configurations
- For significant redesigns, creating a new system definition provides a cleaner separation

Update system

Delete a system definition by ID. This will not delete associated system versions.

Delete system

Create a new version for a system.

Each version contains specific parameter values that match the system's `configSchema` - things like model parameters, thresholds, or processing options.
Once created, versions cannot be modified, ensuring stable reference points for evaluations.

When creating a system version:
- The `config` object is validated against the parent system's `configSchema`.
- System versions with validation errors are still stored, with errors included in the response.
- Validation errors indicate fields that don't match the schema but don't prevent creation.
- Having validation errors may affect how some evaluation metrics are calculated.

Create system version

Retrieve a paginated list of system versions for a specific system.

System versions provide concrete parameter values for a System Under Test, defining exactly how the system should be configured during an evaluation run.

List system versions

Retrieve a specific system version by ID.

Introduction

Features

How To Use Scorecard

Overview

Quickstart

API Reference

Cookbooks

Guides

Scorecard Overview

Features

Privacy & Security

Service Status

Questions? Contact Us

Introduction

Features

How To Use Scorecard

Quickstart

API Reference

Cookbooks

Guides

Scorecard Overview

Features

Privacy & Security

Service Status

​Questions? Contact Us

Questions? Contact Us