June 13, 2025

🎯 Metric API

We’ve launched new API endpoints for programmatic metric management, enabling teams to create and update metrics directly through our SDK. Create metrics with full control over evaluation type, output format, and prompt templates:

client.metrics.create(
    project_id="314",
    name="Response Accuracy",
    eval_type="ai",
    output_type="boolean",
    prompt_template="Evaluate if the following response is factually accurate: {{outputs.response}}",
)

These endpoints are supported in SDK versions 1.1.0 (JS) and 2.1.0 (Python).

🎯 Improved API key format

We switched to a new API key format, which supports having multiple API keys, revoking them, and setting expiry dates. Our new API keys are more concise (94% shorter!) than the previously unwieldy API keys based on JWTs. We’ve deprecated the old API key format, but will continue to support them until July 20, 2025.

Bug fixes and improvements

  • [Metrics] Improved metrics management UI with tabbed sections for better organization
  • [Metrics] Added metric selector with searchable dropdown - no more manually typing metric IDs

  • [Runs] In the test record details page you can now see the text of the fully compiled metrics sent to the LLM for evaluation

June 10, 2025

🚀 SDK v2 Stable Release

Following our May 30th launch, SDK v2 is now fully stable and battle-tested. The new SDK features simplified APIs with ergonomic helper methods like runAndEvaluate (JS/TS) and run_and_evaluate (Python), making it easier than ever to integrate evaluations into your workflow.

We’ve made the SDK even more flexible - testset_id is now optional in our helper methods, allowing you to run evaluations with custom inputs without requiring a pre-defined testset. Additional improvements include:

  • System configurations for experimenting with different model settings
  • Enhanced error handling and debugging capabilities
  • Full TypeScript support with comprehensive type definitions

Available now on npm and PyPI!

🎯 Basic and Advanced Metric Modes

We’ve introduced a two-mode system for creating and editing metrics, making Scorecard accessible to users at every technical level.

Basic mode simplifies metric creation by focusing solely on the evaluation guidelines - just describe what you want to measure in plain language without worrying about prompt templates or variables. Advanced mode gives power users full control over the entire prompt template, including variable handling and custom formatting.

Bug fixes and improvements

  • [Documentation] Streamlined quickstart guide for faster onboarding
  • [Playground] More detailed error messages and prominent results table button
  • [Runs] Delete runs directly from the runs list page
  • [Runs] Run status now updates automatically based on scoring progress
  • [Platform] Fixed data invalidation bugs when managing metrics and testcases
May 30, 2025

🔧 Metric Groups

We’ve introduced metric groups (formerly scoring configs) to streamline how you manage and organize your evaluation metrics. Create custom groups of metrics that can be applied to runs with a single selection, making it easier to maintain consistent evaluation standards across your projects. Manage metric groups through an intuitive UI and apply them directly when triggering runs.

🧪 New Playground Experience

The new Scorecard Playground is here! Built for product teams including prompt engineers, product managers, and subject matter experts, it provides a powerful environment to iterate on prompts and test them against multiple inputs simultaneously. Experience smart variable detection with autocomplete support for Jinja syntax, making template creation effortless. Watch your templates come to life with live preview that shows compiled output as you type. Key capabilities include:

  • Batch testing - run prompts against entire testsets with one click
  • Model configuration with customizable temperature and parameters
  • Persistent state - your playground configuration is saved in the URL

Access the playground from any project and start experimenting with your prompts immediately.

May 23, 2025

🎉 New Homepage Launch!

We’re excited to announce the launch of our new Scorecard homepage! Explore our refreshed website with quick access to docs, product, and a new visual design to share how Scorecard can support you as you’re building your AI product.

⚙️ Fully Configurable Metrics

We’ve launched a new capability that allows two powerful enhancements for your metrics:

  • Fully customize your evaluation model (GPT-4o, Claude-3, Gemini, etc.) or your own selected hosted model
  • Metrics now also support structured outputs, significantly increasing the reliability of your scores.

May 2, 2025

🔌 Scorecard MCP Server

We’ve published an MCP server for Scorecard, enabling powerful new integration possibilities! MCP is an open protocol that standardizes how applications provide context to LLMs. This allows your Scorecard evaluations to seamlessly connect with AI systems. This lets you integrate your evaluation data with various AI tools through a single protocol rather than maintaining separate integrations for each service.

💯 Scorecard Eval Day!

We recently held our first Scorecard Evaluation Day with a cohort of inception-stage founders who are deeply invested in their AI agents’ quality. During the event, we exchanged valuable ideas around evaluation goals and methodologies, had thoughtful discussions about defining meaningful metrics, and explored approaches to evaluation (including the Scorecard method) and integrating evaluation into CI/CD pipelines. Thanks to everyone who attended! We greatly appreciate the insightful feedback from these teams and have already implemented several improvements to our UI flows and SDK integration patterns based on your input.

We’ve launched in-app documentation search, making it easier to find exactly what you need without leaving the platform. Now you can quickly search through all of Scorecard’s documentation directly within the application. You can access it by pressing Cmd K or clicking the search docs button in the top right.

👷 SDK v2 Beta

We’ve added ergonomic methods to our SDKs to make integration even more seamless. The helper functions runAndEvaluate in the JS/TS SDK and run_and_evaluate in the Python SDK let you easily evaluate systems against testcases and metrics.

🔧 Scorecard Playground Scheduled Maintenance

We are taking the Scorecard Cloud playground offline for maintenance and upgrades starting May 7th. Note this will not affect custom integrations (e.g. GitHub kickoff). Please reach out to the team at team@scorecard.io or via direct message if you have a workflow that will be affected by this!

Bug fixes and improvements

  • [UI] Fixed record rendering in the UI when using the new SDK.
  • [Navigation] Repaired broken links in the Run grades table to Testcase and Record pages.
  • [Terminology] Standardized terminology by renaming “test record” to “record” throughout the UI.
  • [Documentation] Streamlined our quickstart documentation so you can get started with Scorecard in just 5 minutes.
April 25, 2025

New API and SDK

We’ve released the alpha of our new Scorecard SDKs, featuring streamlined API endpoints for creating, listing, and updating system configurations, as well as programmatic experiment execution. With this alpha, you can integrate scoring runs directly into your development and CI workflows, configure systems as code, and fully automate your evaluation pipeline without manual steps.
Our pre-release Python SDK (2.0.0-alpha.0) and Javascript SDK (1.0.0-alpha.1) are now available.

New Quickstarts and Documentation

To support the SDK alpha, we’ve launched comprehensive SDK reference docs and concise quickstart guides that show you how to:

  • Install and initialize the TypeScript or Python SDK.

  • Create and manage system configurations in code.

  • Run your first experiment programmatically.

  • Retrieve and interpret run results within your applications.

  • Follow these step-by-step walkthroughs to get your first experiment up and running in under five minutes.

    Bug fixes and improvements

  • [Performance] Reduced page load times and improved responsiveness when handling large run results for the run history table.

  • [UI] Removed metric-specific scoring progress, scoring and execution start and end times, and improved project names wrap across all screen sizes.

  • [Testsets] Resolved bug with new CSV upload flow

  • [Testsets] Added back Move testset to project

  • [Testsets] Archived testsets are hidden correctly, keeping your workspace clutter‑free.

  • [Reliability] Improved API reliability and workflow robustness: fixed run creation schema errors, streamlined testcase creation/duplication/deletion flows, and added inline schema validation to prevent submission errors

  • [Evals] Migrated from gpt4-1106-preview (Nov 2023) to gpt-4o for scoring metrics

April 18, 2025

Run insights

We’ve added a new Run History chart on Runs & Results that visualises your performance trends over time to spot regressions or sustained improvements at a glance (up and to the right!).  The x‑axis is the run date, the y‑axis is the mean score, and each metric gets its own colored line. You can view this by clicking on the ‘All Runs’ tab of Runs & Results.

🧪 Testsets 2.0 for Easier creation and iteration

Testsets got a full upgrade. We reworked the creation flow, added AI-powered example generation, and streamlined testcase iteration. Filtering, sorting, editing, and bulk actions are now faster and more intuitive—so you can ship better tests, faster.

  • You can now create testsets with a simplified modal and generate relevant example testcases based on title and description.

  • Bulk editing tools make it easier to manage and update multiple testcases at once.

  • You can edit large JSON blobs inline in the testcase detail view, with improved scroll and copy behavior.

  • The testset detail page now shows the associated schema in context for easier debugging and review.

  • Navigation has improved with linked testset titles and run/testcase summaries directly accessible from the cards.

🗂️ Improved schema management

Schemas are now defined and managed per testset, rather than at the project level—giving teams more flexibility and control.

  • The schema editor has been redesigned, allowing teams to update schemas independently for each testset.
  • Schema changes now reflect immediately in the testcase table to help users see their impact in real time.
  • Users can view and copy raw schema JSON for integration with their own tools or SDKs.
  • We’ve also improved messaging in the schema editor to clarify the distinction between inputs and labels.

Bug fixes and improvements

  • [Platform] Internal tech stack upgrades to support faster product iteration
  • [Testsets] Friendlier zero-state, faster load, cards with quick actions and live counts
  • [Testsets] Tag propagation fix — updates now apply across all testsets and views
  • [Testsets] Improved sorting behavior, including reliable default and column sorting
  • [Testsets] Filter testcases by keyword, searching across the full dataset
  • [Testsets] Updated page actions with visible bulk tools for managing multiple testcases
  • [Testsets] Testset cards now link to runs/testcases, and support fast schema editing, duplication, or deletion
  • [Testsets] Titles now link directly to the testset detail page
  • [Testcases] Detail page supports editing and copying large JSON blobs
  • [Testcases] Schema panel added for better context while reviewing or editing testcases
  • [Projects] Enhanced cards with summaries for testsets, metrics, and runs, all linked for easier navigation
  • [Projects] Improved sorting with more intuitive labels and default order
  • [Projects] Faster load performance across the project overview page
  • [Schemas] Improved editor messaging to clarify the difference between input fields and labels
  • [Toast Messages] Now deep-link to newly created items (testsets, testcases, projects)
  • [Performance] Faster page loads, filtering, sorting, and table actions, powered by new APIs and backend improvements
April 11, 2025

Projects

We simplified project creation by adding a create project modal to the projects page and project detail pages.

SDKs

We’re working on overhauling our API and SDKs. We switched to using Stainless for SDK generation and released version 1.0.0-alpha.0 of our Node SDK. Over the next few weeks, we will stabilize the new API and Node and Python SDKs.

Bug fixes and improvements

  • [Playground] Filtering by testcase now works properly as well as searching. The pagination was also improved and now works as expected when before it showed inconsistent items in some cases.
  • [Testsets] We fixed a bug where we exported empty testcase values as the string “null” instead of an empty string.
  • [Projects] We simplified project creation by adding a create project modal to the projects page and project detail pages.
  • [Testcases] We fixed a bug that broke the Generate testcases feature.
  • [Settings] We added some text on the API keys page to clarify that your Scorecard API key is personal, but model API keys are scoped to the organization.
April 4, 2025

Tracing

We’ve significantly improved our trace management system by relocating traces within the project hierarchy for better organization. Users can now leverage robust search capabilities with full-text search across trace data, complete with highlighted match previews. The new date range filtering system offers multiple time range options from 30 minutes to all time, while project scope filtering allows viewing traces from either the current project or across all projects. We’ve enhanced data visualization with dynamic activity charts and improved trace tables for better insights. Our library support now focuses specifically on Traceloop, OpenLLMetry, and OpenTelemetry for optimal integration.

In addition, the trace system now includes intelligent AI span detection that automatically recognizes AI operations across different providers. Visual AI indicators with special badges clearly show model information at a glance. We’ve added test case generation capabilities that extract prompts and completions to easily create test cases. For better resource monitoring, token usage tracking provides detailed metrics for LLM consumption.

Examples repository

We’ve published comprehensive integration examples demonstrating OpenTelemetry configuration with Scorecard, including Python Flask implementation with LLM tracing for OpenAI and Node.js Express implementation with similar capabilities. A new setup wizard provides clear configuration instructions for popular telemetry libraries to help users get started quickly.

We also updated our quickstart documentation to be more comprehensive.

Bug fixes and improvements

  • [Scoring] When a run metric has not yet been scored, we now display N/A instead of NaN, making it clearer that it has no data.
  • [Prompt Management] We made stability and performance improvements to prompt management workflows.
  • [Projects] All resources now belong to projects, including those created before Scorecard Projects were introduced.
  • [Exports] Custom fields in CSV exports of run results are handled more reliably.
  • [Organizations] When a user switches organizations, we now redirect them to the organization’s projects page.
  • [Testsets] On the testcase page, we fixed the link back to the testset.
  • [Metrics] We added a new autosize textarea component that lets you keep typing the metric description without running out of space.
  • [Playground] The “Prompt manager”, “Update”, and “Delete prompt” buttons are now disabled for default prompts. When selecting metrics, the “Select and score now” button is now the primary button rather than the “Select” button.
  • [API] When a user does not include their Scorecard API key, we now return a friendlier 401 error: “Missing API key” rather than “malformed token”.
  • [Scoring] The human scoring panel collapses the run details page allowing users to see model responses and while scoring.
  • [Platform] We enhanced platform stability and increased test coverage.
March 14, 2025

New Project Overview Page

We redesigned our project overview page, including some useful information in the new sidebar and made it possible to edit the name and description of a project in the same place.

February 24, 2025

Download Testset CSVs

You can now download an entire testset as a CSV using the ‘Export as CSV’ option.

Bug fixes and improvements

When looking at a run’s detailed results, the popover for looking at a cell’s contents would randomly disappear. This issue is now fixed.

February 13, 2025

Docs Site Revamp

We’re excited to announce we’ve moved to a completely revamped documentation site! Key improvements include:

  • Improved navigation structure
  • Better search functionality
  • Enhanced API documentation
  • New updates section to track changes
  • Modern, cleaner design

This change will help us better serve our users with clearer, more organized documentation.