Questions? Contact Us

Here you find guides and resources to help evaluate your LLM applications with Scorecard.

Overview

Scorecard AI Docs

Evaluate a simple LLM system with Scorecard in minutes.

Quickstart

Systems: Beyond Simple Prompts

Scorecard is a platform to support product teams and engineers build, evaluate and deploy Large Language Model (LLM) applications to production.

What Is Scorecard?

Let’s walk through the journey of an LLM Developer, from start to finish. Buckle up! 🤠

The LLM Developer's Journey

Technical System Overview

Projects

The Playground is where you test and refine your AI prompts using real data. It's a three-panel interface that lets you select test data, edit prompts with Jinja templating, configure AI models, and see results in real-time.

Playground

Prompt Management

Tracing

Preserving the privacy of our clients and ensuring secure processes is of top priority at Scorecard.

Privacy By Design

Automated Scoring

Best In Class Metrics

Imagine these scenarios: The performance of your LLM application drops suddenly, you lose track of which LLM calls you have made, or you do not notice potential security leaks. 

Logging

A Scorecard Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios.

Testset Management

During the development of LLM applications, it is common practice to iteratively adjust the system to find the optimal setup that produces the best results. 

A/B Comparison

In order to develop controllable and safe LLM applications, you can integrate Scorecard with the open-source toolkits [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) and [Guardrails AI](https://github.com/guardrails-ai/guardrails). 

AI Guardrails

Run Inspection

Scorecard leverages the open source AI Proxy LiteLLM to allow you to manage multiple LLM APIs and easily swap out different AI models.

AI Proxy

Welcome to the Scorecard Cookbooks. Here you’ll find code and guides for accomplishing common tasks with the Scorecard API and SDKs.

Cookbooks

Retrieval Augmented Generation (RAG) is the use of retrieval methods (e.g. via search and vector stores) to provide generative models with additional context, or *grounding*.

RAG: Retrieval Augmented Generation

Trigger a Run with GitHub Actions

Browse the API documentation and integrate with Scorecard's API endpoints.

API Reference

Create Project

Retrieve a paginated list of all Projects. Projects are ordered by creation date, with oldest Projects first.

List Projects

Create a new Testset for a Project. The Testset will be created in the Project specified in the path.

Create Testset

Retrieve a paginated list of Testsets belonging to a Project.

List Testsets in Project

Get Testset

Update a Testset. Only the fields provided in the request body will be updated.
If a field is provided, the new content will replace the existing content.
If a field is not provided, the existing content will remain unchanged.

When updating the schema:
- If field mappings are not provided and existing mappings reference fields that no longer exist, those mappings will be automatically removed
- To preserve all existing mappings, ensure all referenced fields remain in the updated schema
- For complete control, provide both schema and fieldMapping when updating the schema

Update Testset

Delete Testset

Create multiple Testcases in the specified Testset.

Create multiple Testcases

Retrieve a paginated list of Testcases belonging to a Testset.

List Testcases in Testset

Get Testcase

Replace the data of an existing Testcase while keeping its ID.

Update Testcase

Delete multiple Testcases

Create a new Metric for evaluating system outputs. The structure of a metric depends on the evalType and outputType of the metric.

Create Metric

Update an existing Metric. You must specify the evalType and outputType of the metric. The structure of a metric depends on the evalType and outputType of the metric.

Update Metric

Create Run

Create Record

Create or update a Score for a given Record and MetricConfig. If a Score with the specified Record ID and MetricConfig ID already exists, it will be updated. Otherwise, a new Score will be created. The score provided should conform to the schema defined by the MetricConfig; otherwise, validation errors will be reported.

Upsert Score

Create a new system. If one with the same name in the project exists, it updates it instead.

Create (upsert) system

Retrieve a paginated list of all systems. Systems are ordered by creation date.

List systems

Get system

Update an existing system. Only the fields provided in the request body will be updated.
If a field is provided, the new content will replace the existing content.
If a field is not provided, the existing content will remain unchanged.

Update system

Delete a system definition by ID. This will not delete associated system versions.

Delete system

Create a new system version if it does not already exist. Does **not** set the created version to be the system's production version.

If there is already a system version with the same config, its name will be updated.

Upsert system version

Retrieve a specific system version by ID.

Introduction

Features

How To Use Scorecard

Overview

Quickstart

API Reference

Cookbooks

Guides

Scorecard Overview

Features

Privacy & Security

Service Status

Questions? Contact Us

Introduction

Features

How To Use Scorecard

Quickstart

API Reference

Cookbooks

Guides

Scorecard Overview

Features

Privacy & Security

Service Status

​Questions? Contact Us

Questions? Contact Us