New updates and improvements
Managing systems in Scorecard just got easier and more powerful:
We’ve updated the Scorecard UI to match our vibrant new brand colors, moving from purple to a fresh orange theme. We hope you love the refreshed look and please share your feedback with us!
We’ve launched new API endpoints for programmatic metric management, enabling teams to create and update metrics directly through our SDK. Create metrics with full control over evaluation type, output format, and prompt templates:
These endpoints are supported in SDK versions 1.1.0 (JS) and 2.1.0 (Python).
We switched to a new API key format, which supports having multiple API keys, revoking them, and setting expiry dates. Our new API keys are more concise (94% shorter!) than the previously unwieldy API keys based on JWTs. We’ve deprecated the old API key format, but will continue to support them until July 20, 2025.
Following our May 30th launch, SDK v2 is now fully stable and battle-tested. The new SDK features simplified APIs with ergonomic helper methods like runAndEvaluate
(JS/TS) and run_and_evaluate
(Python), making it easier than ever to integrate evaluations into your workflow.
We’ve made the SDK even more flexible - testset_id
is now optional in our helper methods, allowing you to run evaluations with custom inputs without requiring a pre-defined testset. Additional improvements include:
Available now on npm and PyPI!
We’ve introduced a two-mode system for creating and editing metrics, making Scorecard accessible to users at every technical level.
Basic mode simplifies metric creation by focusing solely on the evaluation guidelines - just describe what you want to measure in plain language without worrying about prompt templates or variables. Advanced mode gives power users full control over the entire prompt template, including variable handling and custom formatting.
We’ve introduced metric groups (formerly scoring configs) to streamline how you manage and organize your evaluation metrics. Create custom groups of metrics that can be applied to runs with a single selection, making it easier to maintain consistent evaluation standards across your projects. Manage metric groups through an intuitive UI and apply them directly when triggering runs.
The new Scorecard Playground is here! Built for product teams including prompt engineers, product managers, and subject matter experts, it provides a powerful environment to iterate on prompts and test them against multiple inputs simultaneously. Experience smart variable detection with autocomplete support for Jinja syntax, making template creation effortless. Watch your templates come to life with live preview that shows compiled output as you type. Key capabilities include:
Access the playground from any project and start experimenting with your prompts immediately.
We’re excited to announce the launch of our new Scorecard homepage! Explore our refreshed website with quick access to docs, product, and a new visual design to share how Scorecard can support you as you’re building your AI product.
We’ve launched a new capability that allows two powerful enhancements for your metrics:
We’ve published an MCP server for Scorecard, enabling powerful new integration possibilities! MCP is an open protocol that standardizes how applications provide context to LLMs. This allows your Scorecard evaluations to seamlessly connect with AI systems. This lets you integrate your evaluation data with various AI tools through a single protocol rather than maintaining separate integrations for each service.
We recently held our first Scorecard Evaluation Day with a cohort of inception-stage founders who are deeply invested in their AI agents’ quality. During the event, we exchanged valuable ideas around evaluation goals and methodologies, had thoughtful discussions about defining meaningful metrics, and explored approaches to evaluation (including the Scorecard method) and integrating evaluation into CI/CD pipelines. Thanks to everyone who attended! We greatly appreciate the insightful feedback from these teams and have already implemented several improvements to our UI flows and SDK integration patterns based on your input.
We’ve launched in-app documentation search, making it easier to find exactly what you need without leaving the platform. Now you can quickly search through all of Scorecard’s documentation directly within the application. You can access it by pressing Cmd K or clicking the search docs button in the top right.
We’ve added ergonomic methods to our SDKs to make integration even more seamless. The helper functions runAndEvaluate
in the JS/TS SDK and run_and_evaluate
in the Python SDK let you easily evaluate systems against testcases and metrics.
We are taking the Scorecard Cloud playground offline for maintenance and upgrades starting May 7th. Note this will not affect custom integrations (e.g. GitHub kickoff). Please reach out to the team at team@scorecard.io or via direct message if you have a workflow that will be affected by this!
We’ve released the alpha of our new Scorecard SDKs, featuring streamlined API endpoints for creating, listing, and updating system configurations, as well as programmatic experiment execution. With this alpha, you can integrate scoring runs directly into your development and CI workflows, configure systems as code, and fully automate your evaluation pipeline without manual steps.
Our pre-release Python SDK (2.0.0-alpha.0) and Javascript SDK (1.0.0-alpha.1) are now available.
To support the SDK alpha, we’ve launched comprehensive SDK reference docs and concise quickstart guides that show you how to:
Install and initialize the TypeScript or Python SDK.
Create and manage system configurations in code.
Run your first experiment programmatically.
Retrieve and interpret run results within your applications.
Follow these step-by-step walkthroughs to get your first experiment up and running in under five minutes.
[Performance] Reduced page load times and improved responsiveness when handling large run results for the run history table.
[UI] Removed metric-specific scoring progress, scoring and execution start and end times, and improved project names wrap across all screen sizes.
[Testsets] Resolved bug with new CSV upload flow
[Testsets] Added back Move testset to project
[Testsets] Archived testsets are hidden correctly, keeping your workspace clutter‑free.
[Reliability] Improved API reliability and workflow robustness: fixed run creation schema errors, streamlined testcase creation/duplication/deletion flows, and added inline schema validation to prevent submission errors
[Evals] Migrated from gpt4-1106-preview (Nov 2023) to gpt-4o for scoring metrics
We’ve added a new Run History chart on Runs & Results that visualises your performance trends over time to spot regressions or sustained improvements at a glance (up and to the right!). The x‑axis is the run date, the y‑axis is the mean score, and each metric gets its own colored line. You can view this by clicking on the ‘All Runs’ tab of Runs & Results.
Testsets got a full upgrade. We reworked the creation flow, added AI-powered example generation, and streamlined testcase iteration. Filtering, sorting, editing, and bulk actions are now faster and more intuitive—so you can ship better tests, faster.
You can now create testsets with a simplified modal and generate relevant example testcases based on title and description.
Bulk editing tools make it easier to manage and update multiple testcases at once.
You can edit large JSON blobs inline in the testcase detail view, with improved scroll and copy behavior.
The testset detail page now shows the associated schema in context for easier debugging and review.
Navigation has improved with linked testset titles and run/testcase summaries directly accessible from the cards.
Schemas are now defined and managed per testset, rather than at the project level—giving teams more flexibility and control.
We simplified project creation by adding a create project modal to the projects page and project detail pages.
We’re working on overhauling our API and SDKs. We switched to using Stainless for SDK generation and released version 1.0.0-alpha.0 of our Node SDK. Over the next few weeks, we will stabilize the new API and Node and Python SDKs.
We’ve significantly improved our trace management system by relocating traces within the project hierarchy for better organization. Users can now leverage robust search capabilities with full-text search across trace data, complete with highlighted match previews. The new date range filtering system offers multiple time range options from 30 minutes to all time, while project scope filtering allows viewing traces from either the current project or across all projects. We’ve enhanced data visualization with dynamic activity charts and improved trace tables for better insights. Our library support now focuses specifically on Traceloop, OpenLLMetry, and OpenTelemetry for optimal integration.
In addition, the trace system now includes intelligent AI span detection that automatically recognizes AI operations across different providers. Visual AI indicators with special badges clearly show model information at a glance. We’ve added test case generation capabilities that extract prompts and completions to easily create test cases. For better resource monitoring, token usage tracking provides detailed metrics for LLM consumption.
We’ve published comprehensive integration examples demonstrating OpenTelemetry configuration with Scorecard, including Python Flask implementation with LLM tracing for OpenAI and Node.js Express implementation with similar capabilities. A new setup wizard provides clear configuration instructions for popular telemetry libraries to help users get started quickly.
We also updated our quickstart documentation to be more comprehensive.
We redesigned our project overview page, including some useful information in the new sidebar and made it possible to edit the name and description of a project in the same place.
We’re excited to announce we’ve moved to a completely revamped documentation site! Key improvements include:
This change will help us better serve our users with clearer, more organized documentation.