August 29, 2025

🔌 Custom Endpoints with Multi-Turn Support

Test any HTTP API endpoint directly in Scorecard. Configure once, use everywhere - from production APIs to local development servers.Screenshot of the Edit Endpoint modalScreenshot of the Edit Endpoint modalScreenshot of Kickoff Run modal with Endpoint tabScreenshot of Kickoff Run modal with Endpoint tab
  • Universal API Testing: Support for all HTTP methods with custom headers and bodies
  • Multi-Turn Conversations: Enable simulated conversations with configurable AI personas
  • Tabbed Run Interface: Switch between Scorecard, GitHub, and Endpoints in one modal
  • Response Path Extraction: Extract specific values from JSON responses for evaluation

🎯 Streamlined Onboarding

New users now start with a pre-configured example project and automatic first run, allowing them to see evaluation results immediately without any manual setup.Screenshot of the onboarding bannerScreenshot of the onboarding banner
  • Auto-generated Project: Complete with testsets, metrics, and sample data
  • Instant First Run: Results ready on first login
  • Guided Actions: Clear prompts to view results or create custom runs

📚 Expanded Documentation

New comprehensive guides expanding our feature documentation:
August 22, 2025

📋 Trace to Testcase Creation

Create testcases directly from production traces with our new trace-to-testcase feature. Turn real user interactions into structured test data with a single click, making it easy to build datasets from production traffic that actually matter.Screenshot of trace overview with create testcase buttonScreenshot of trace overview with create testcase buttonThe workflow is simple: select a span, choose your testset, and Scorecard auto-extracts the prompt and completion fields for you.Screenshot of testset selection modalScreenshot of testset selection modalScreenshot of testcase fields auto-populated from traceScreenshot of testcase fields auto-populated from traceKey capabilities:
  • One-click creation: Convert any trace into a testcase directly from the trace details page
  • Smart field detection: Automatically populate prompt and completion values from trace data
  • Schema compatibility: Automatically detects target testset schema and maps fields correctly
  • Production-grounded datasets: Build testsets from real user interactions instead of synthetic data
Perfect for creating “golden datasets” from your best production examples and edge cases you want to regression test.

🏷️ Metadata Fields in Testset Schemas

You can now mark fields as “metadata” in your testset schema management UI. Metadata fields are stored with your testcases but won’t be used during evaluation runs, giving you more flexibility to store contextual information alongside your test data.Screenshot of add testcase modal showing metadata fieldScreenshot of add testcase modal showing metadata field
  • Flexible data storage: Store additional context without affecting evaluation logic
  • Schema management: Easy toggle in the testset schema editor
  • Future SDK support: Full SDK integration coming soon for programmatic access

🛠️ Bug Fixes and Improvements

  • [Tracing] Fixed pagination issues with traces containing many spans - improved parent span detection
  • [Testcases] Fixed refresh bug where editing a field could overwrite with old data
  • [Projects] Fixed refresh issue after creating new projects - list now updates immediately
  • [Scoring] Better error messages when OpenAI API keys are missing instead of generic “Internal service error”
  • [UI] Improved copy testcase, move metric, and move testset dialogs with better scrollability and selection
  • [Documentation] Added comprehensive guides for monitoring quickstart, monitoring features, and trace-to-testcase
August 8, 2025

🤖 Sim Agents for Multi-Turn Conversation Testing

We’ve launched Sim Agents, a powerful new capability for testing multi-turn conversations with your AI systems. Create configurable AI personas that interact with your system during testing, simulating real user behaviors from polite customers to escalation scenarios.Screenshot of Sim Agent configurationScreenshot of Sim Agent configurationKey capabilities:
  • Persona configuration: Define user behaviors, goals, and interaction patterns with Jinja2 templating
  • SDK integration: Run simulations programmatically with multi_turn_simulation() method
  • Conversation control: Set stop conditions, max turns, and timeout limits for realistic testing
  • Chat visualization: View full conversation history in beautiful chat bubble format

📊 Online Evaluations (Beta)

Monitor and evaluate your AI systems in production with our new online evaluation infrastructure. Configure monitoring rules per project, automatically score production traces, and get real-time insights into model performance.Screenshot of online evaluation scores in trace detailsScreenshot of online evaluation scores in trace details
  • Monitoring configuration: Set up project-specific rules with selected metrics and scheduling
  • Cost visibility: View token costs per span and per evaluation directly in the UI
  • Trace integration: Link test records to their source traces for full observability
  • Real-time scoring: Automatically score production traces as they are ingested

🛠️ Bug Fixes and Improvements

  • [Performance] Major Prisma upgrade delivering 2-5x faster load times - run lists now load in under 2 seconds (down from 3-4s)
  • [Monitoring] Added token cost tracking per span in traces page
Screenshot showing cost tracking in traces listScreenshot showing cost tracking in traces list
  • [Monitoring] Display scoring costs for LLM-as-a-judge evaluations
  • [Infrastructure] Increased collector object size limit for larger traces
August 1, 2025

📊 Enhanced Tracing Experience

We’ve completely reimagined our tracing interface to provide deeper insights into your AI system’s performance. The new tracing UI features flame graph visualizations that make it instantly clear which operations take the longest, helping you identify and optimize bottlenecks in your LLM pipelines.Screenshot of the new tracing UI with flame graph visualizationScreenshot of the new tracing UI with flame graph visualizationKey improvements include:
  • Flame graph visualization: See span durations at a glance to quickly identify performance bottlenecks
  • Smart defaults: Tracing page now defaults to “All projects” view so you never miss traces
  • Simplified authentication: Use your standard Scorecard API key for tracing - no more separate JWT tokens

🚀 Smarter Run Status Tracking

We’ve significantly enhanced how run statuses are calculated and displayed throughout the platform. You can now see exactly why a run is in its current state with detailed hover explanations, making it easier to understand and debug your evaluation pipelines.Screenshot of improved run status with hover explanationsScreenshot of improved run status with hover explanations
  • Intelligent status calculation: Database now tracks expected vs. actual test records for accurate progress reporting
  • Hover explanations: See detailed progress like “0 testrecords created / 35 testcases from testset 1234” to understand exactly what’s happening
  • Real-time updates: Run status automatically updates as scoring progresses

🛠️ Bug Fixes and Improvements

  • [Tracing] Fixed crash when navigating back to page 2 in traces list
  • [Tracing] Time range filter now correctly updates chart data
  • [Documentation] Published tracing quickstart guide with integrations for Vercel AI SDK and OpenLLMetry
July 25, 2025

📊 Run History

We’ve added a new Run History visualization to help you track evaluation performance over time. The chart displays aggregate scores per run, making it easy to spot trends and regressions across your metrics. You can view this directly on the run list page to see how your model’s performance evolves with each iteration.Screenshot of the Run History chartScreenshot of the Run History chart

🔍 Tracing improvements

We’ve upgraded our tracing infrastructure to provide better observability into your AI systems:
  • Open-source collector architecture: Migrated to the upstream OpenTelemetry standard collector on Railway for improved reliability and performance.
  • Unified API key authentication: You can now use your existing Scorecard API key for tracing so you no longer need to manage separate authentication tokens.
  • New tracing integration: Published official integrations for Vercel AI SDK and OpenLLMetry with examples in both Python and Node.js. Our new step-by-step guide gets you from zero to traced LLM calls in minutes

🛠️ Bug Fixes and Improvements

  • [Run kickoff] Fixed model scrolling in Run Kickoff modal UI.
  • [Run details] Metrics cards and score columns are now consistently sorted alphabetically.
  • [Run details] Sorting a column now displays the sort direction.
July 18, 2025

Separate Score Columns

On the run details page, each metric now has its own column of scores, rather than a single column for all metrics. This makes it easier to compare the scores across records for a particular metric. It also enables sorting records by a metric’s score.
Screenshot of the separate score columns in the UI.Screenshot of the separate score columns in the UI.

Separate score columns in the UI.

🛠️ Bug Fixes and Improvements

  • [Performance] Improved loading time of the Runs list page.
  • [Runs] Added a new “Run Again” button to the Run details page, allowing you to re-run the same testset/system combo with the same metrics and model parameters.
  • [Runs] Added “System Version” link to the Run details page.
July 11, 2025

🚀 Onboarding Improvements

We’ve streamlined the onboarding process for new organizations with several key improvements:
  • Automatic defaults: New organizations now receive default projects, testsets, metrics, and prompts automatically, significantly reducing initial setup time
  • Free API key included: Every new organization gets a default free API key (Gemini Flash), eliminating the need for initial user configuration. Users without their own API key will be clearly notified that the system is defaulting to the free API

🏃 Kickoff Runs from Playground

You can now trigger runs directly from the Playground and view them in the runs & results section. We’re considering adding scoring capabilities from the Playground as well - let us know if this is important to your team!Trigger runs directly from the Playground

🛠️ Bug Fixes and Improvements

  • [Performance] Improved page load performance on testcases, testsets, metrics, and trigger run pages
  • [Projects] Added search functionality for projects
  • [Metrics] Added search functionality for metrics
  • [Testsets] Added search functionality for testsets
  • [Systems] All systems are now required to have a production version
  • [Quickstart] Both main and systems quickstarts now reflect the latest SDK with clearer steps and updated code, enabling new users to go from install to first run in minutes
  • [API Keys] Every member can view existing keys, but only admins can create or revoke them, providing teams with transparency while maintaining control
  • [Docs Search] Rebuilt documentation search pipeline reducing typical response times from ~15s to just 3s, helping you find answers more quickly
June 20, 2025

🖥️ Systems Enhancements

Managing systems in Scorecard just got easier and more powerful:
  • You can now directly trigger a run from the trigger-run page, making it quicker and simpler to execute tests.
  • All your systems are now clearly visible and easily manageable from a single, user-friendly interface. Update configurations, manage versions, and maintain systems effortlessly. Systems overview interface showing all systems in one place

🎨 New Color Palette

We’ve updated the Scorecard UI to match our vibrant new brand colors, moving from purple to a fresh orange theme. We hope you love the refreshed look and please share your feedback with us!New orange color palette in the Scorecard UI

🛠️ Bug Fixes and Improvements

  • [Runs] We’ve improved pagination for test records in runs, making load times and navigation quicker and setting the stage for upcoming filtering enhancements.
June 13, 2025

🎯 Metric API

We’ve launched new API endpoints for programmatic metric management, enabling teams to create and update metrics directly through our SDK. Create metrics with full control over evaluation type, output format, and prompt templates:
client.metrics.create(
    project_id="314",
    name="Response Accuracy",
    eval_type="ai",
    output_type="boolean",
    prompt_template="Evaluate if the following response is factually accurate: {{outputs.response}}",
)
These endpoints are supported in SDK versions 1.1.0 (JS) and 2.1.0 (Python).

🎯 Improved API key format

We switched to a new API key format, which supports having multiple API keys, revoking them, and setting expiry dates. Our new API keys are more concise (94% shorter!) than the previously unwieldy API keys based on JWTs. We’ve deprecated the old API key format, but will continue to support them until July 20, 2025.New API key management interface showing multiple keys and expiry options

Bug fixes and improvements

  • [Metrics] Improved metrics management UI with tabbed sections for better organization
  • [Metrics] Added metric selector with searchable dropdown - no more manually typing metric IDs Metric selector with searchable dropdown
  • [Runs] In the test record details page you can now see the text of the fully compiled metrics sent to the LLM for evaluation Compiled metrics text view in run details
June 10, 2025

🚀 SDK v2 Stable Release

Following our May 30th launch, SDK v2 is now fully stable and battle-tested. The new SDK features simplified APIs with ergonomic helper methods like runAndEvaluate (JS/TS) and run_and_evaluate (Python), making it easier than ever to integrate evaluations into your workflow.We’ve made the SDK even more flexible - testset_id is now optional in our helper methods, allowing you to run evaluations with custom inputs without requiring a pre-defined testset. Additional improvements include:
  • System configurations for experimenting with different model settings
  • Enhanced error handling and debugging capabilities
  • Full TypeScript support with comprehensive type definitions
Available now on npm and PyPI!

🎯 Basic and Advanced Metric Modes

We’ve introduced a two-mode system for creating and editing metrics, making Scorecard accessible to users at every technical level.Basic mode simplifies metric creation by focusing solely on the evaluation guidelines - just describe what you want to measure in plain language without worrying about prompt templates or variables. Advanced mode gives power users full control over the entire prompt template, including variable handling and custom formatting.Metric editor interface showing Basic and Advanced mode options

Bug fixes and improvements

  • [Documentation] Streamlined quickstart guide for faster onboarding
  • [Playground] More detailed error messages and prominent results table button
  • [Runs] Delete runs directly from the runs list page
  • [Runs] Run status now updates automatically based on scoring progress
  • [Platform] Fixed data invalidation bugs when managing metrics and testcases
May 30, 2025

🔧 Metric Groups

We’ve introduced metric groups (formerly scoring configs) to streamline how you manage and organize your evaluation metrics. Create custom groups of metrics that can be applied to runs with a single selection, making it easier to maintain consistent evaluation standards across your projects. Manage metric groups through an intuitive UI and apply them directly when triggering runs.Metric Group Editor

🧪 New Playground Experience

The new Scorecard Playground is here! Built for product teams including prompt engineers, product managers, and subject matter experts, it provides a powerful environment to iterate on prompts and test them against multiple inputs simultaneously. Experience smart variable detection with autocomplete support for Jinja syntax, making template creation effortless. Watch your templates come to life with live preview that shows compiled output as you type. Key capabilities include:
  • Batch testing - run prompts against entire testsets with one click
  • Model configuration with customizable temperature and parameters
  • Persistent state - your playground configuration is saved in the URL
Access the playground from any project and start experimenting with your prompts immediately.Playground New UX
May 23, 2025

🎉 New Homepage Launch!

We’re excited to announce the launch of our new Scorecard homepage! Explore our refreshed website with quick access to docs, product, and a new visual design to share how Scorecard can support you as you’re building your AI product.Screenshot of the new Scorecard homepage

⚙️ Fully Configurable Metrics

We’ve launched a new capability that allows two powerful enhancements for your metrics:
  • Fully customize your evaluation model (GPT-4o, Claude-3, Gemini, etc.) or your own selected hosted model
  • Metrics now also support structured outputs, significantly increasing the reliability of your scores. Screenshot of the Custom Metrics configuration interface
May 2, 2025

🔌 Scorecard MCP Server

We’ve published an MCP server for Scorecard, enabling powerful new integration possibilities! MCP is an open protocol that standardizes how applications provide context to LLMs. This allows your Scorecard evaluations to seamlessly connect with AI systems. This lets you integrate your evaluation data with various AI tools through a single protocol rather than maintaining separate integrations for each service.

💯 Scorecard Eval Day!

We recently held our first Scorecard Evaluation Day with a cohort of inception-stage founders who are deeply invested in their AI agents’ quality. During the event, we exchanged valuable ideas around evaluation goals and methodologies, had thoughtful discussions about defining meaningful metrics, and explored approaches to evaluation (including the Scorecard method) and integrating evaluation into CI/CD pipelines. Thanks to everyone who attended! We greatly appreciate the insightful feedback from these teams and have already implemented several improvements to our UI flows and SDK integration patterns based on your input.Scorecard Eval Day event posterWe’ve launched in-app documentation search, making it easier to find exactly what you need without leaving the platform. Now you can quickly search through all of Scorecard’s documentation directly within the application. You can access it by pressing Cmd K or clicking the search docs button in the top right.Screenshot of the in-app documentation search feature (QA Docsbot)

👷 SDK v2 Beta

We’ve added ergonomic methods to our SDKs to make integration even more seamless. The helper functions runAndEvaluate in the JS/TS SDK and run_and_evaluate in the Python SDK let you easily evaluate systems against testcases and metrics.

🔧 Scorecard Playground Scheduled Maintenance

We are taking the Scorecard Cloud playground offline for maintenance and upgrades starting May 7th. Note this will not affect custom integrations (e.g. GitHub kickoff). Please reach out to the team at team@scorecard.io or via direct message if you have a workflow that will be affected by this!

Bug fixes and improvements

  • [UI] Fixed record rendering in the UI when using the new SDK.
  • [Navigation] Repaired broken links in the Run grades table to Testcase and Record pages.
  • [Terminology] Standardized terminology by renaming “test record” to “record” throughout the UI.
  • [Documentation] Streamlined our quickstart documentation so you can get started with Scorecard in just 5 minutes.
April 25, 2025

New API and SDK

We’ve released the alpha of our new Scorecard SDKs, featuring streamlined API endpoints for creating, listing, and updating system configurations, as well as programmatic experiment execution. With this alpha, you can integrate scoring runs directly into your development and CI workflows, configure systems as code, and fully automate your evaluation pipeline without manual steps.
Our pre-release Python SDK (2.0.0-alpha.0) and Javascript SDK (1.0.0-alpha.1) are now available.
NPM

New Quickstarts and Documentation

To support the SDK alpha, we’ve launched comprehensive SDK reference docs and concise quickstart guides that show you how to:
  • Install and initialize the TypeScript or Python SDK.
  • Create and manage system configurations in code.
  • Run your first experiment programmatically.
  • Retrieve and interpret run results within your applications.
  • Follow these step-by-step walkthroughs to get your first experiment up and running in under five minutes.

    Bug fixes and improvements

  • [Performance] Reduced page load times and improved responsiveness when handling large run results for the run history table.
  • [UI] Removed metric-specific scoring progress, scoring and execution start and end times, and improved project names wrap across all screen sizes.
  • [Testsets] Resolved bug with new CSV upload flow
  • [Testsets] Added back Move testset to project
  • [Testsets] Archived testsets are hidden correctly, keeping your workspace clutter‑free.
  • [Reliability] Improved API reliability and workflow robustness: fixed run creation schema errors, streamlined testcase creation/duplication/deletion flows, and added inline schema validation to prevent submission errors
  • [Evals] Migrated from gpt4-1106-preview (Nov 2023) to gpt-4o for scoring metrics
April 18, 2025

Run insights

We’ve added a new Run History chart on Runs & Results that visualizes your performance trends over time to spot regressions or sustained improvements at a glance (up and to the right!). The x‑axis is the run date, the y‑axis is the mean score, and each metric gets its own colored line. You can view this by clicking on the ‘All Runs’ tab of Runs & Results.Run History

🧪 Testsets 2.0 for Easier creation and iteration

Testsets got a full upgrade. We reworked the creation flow, added AI-powered example generation, and streamlined testcase iteration. Filtering, sorting, editing, and bulk actions are now faster and more intuitive—so you can ship better tests, faster.
  • You can now create testsets with a simplified modal and generate relevant example testcases based on title and description.
  • Bulk editing tools make it easier to manage and update multiple testcases at once.
  • You can edit large JSON blobs inline in the testcase detail view, with improved scroll and copy behavior.
  • The testset detail page now shows the associated schema in context for easier debugging and review.
  • Navigation has improved with linked testset titles and run/testcase summaries directly accessible from the cards. Testset Toast Create Testset

🗂️ Improved schema management

Schemas are now defined and managed per testset, rather than at the project level—giving teams more flexibility and control.
  • The schema editor has been redesigned, allowing teams to update schemas independently for each testset.
  • Schema changes now reflect immediately in the testcase table to help users see their impact in real time.
  • Users can view and copy raw schema JSON for integration with their own tools or SDKs.
  • We’ve also improved messaging in the schema editor to clarify the distinction between inputs and labels. Testset Card Schema Edit

Bug fixes and improvements

  • [Platform] Internal tech stack upgrades to support faster product iteration
  • [Testsets] Friendlier zero-state, faster load, cards with quick actions and live counts
  • [Testsets] Tag propagation fix — updates now apply across all testsets and views
  • [Testsets] Improved sorting behavior, including reliable default and column sorting
  • [Testsets] Filter testcases by keyword, searching across the full dataset
  • [Testsets] Updated page actions with visible bulk tools for managing multiple testcases
  • [Testsets] Testset cards now link to runs/testcases, and support fast schema editing, duplication, or deletion
  • [Testsets] Titles now link directly to the testset detail page
  • [Testcases] Detail page supports editing and copying large JSON blobs
  • [Testcases] Schema panel added for better context while reviewing or editing testcases
  • [Projects] Enhanced cards with summaries for testsets, metrics, and runs, all linked for easier navigation
  • [Projects] Improved sorting with more intuitive labels and default order
  • [Projects] Faster load performance across the project overview page
  • [Schemas] Improved editor messaging to clarify the difference between input fields and labels
  • [Toast Messages] Now deep-link to newly created items (testsets, testcases, projects)
  • [Performance] Faster page loads, filtering, sorting, and table actions, powered by new APIs and backend improvements
April 11, 2025

Projects

We simplified project creation by adding a create project modal to the projects page and project detail pages.Create Project

SDKs

We’re working on overhauling our API and SDKs. We switched to using Stainless for SDK generation and released version 1.0.0-alpha.0 of our Node SDK. Over the next few weeks, we will stabilize the new API and Node and Python SDKs.

Bug fixes and improvements

  • [Playground] Filtering by testcase now works properly as well as searching. The pagination was also improved and now works as expected when before it showed inconsistent items in some cases.
  • [Testsets] We fixed a bug where we exported empty testcase values as the string “null” instead of an empty string.
  • [Projects] We simplified project creation by adding a create project modal to the projects page and project detail pages.
  • [Testcases] We fixed a bug that broke the Generate testcases feature.
  • [Settings] We added some text on the API keys page to clarify that your Scorecard API key is personal, but model API keys are scoped to the organization.
April 4, 2025

Tracing

We’ve significantly improved our trace management system by relocating traces within the project hierarchy for better organization. Users can now leverage robust search capabilities with full-text search across trace data, complete with highlighted match previews. The new date range filtering system offers multiple time range options from 30 minutes to all time, while project scope filtering allows viewing traces from either the current project or across all projects. We’ve enhanced data visualization with dynamic activity charts and improved trace tables for better insights. Our library support now focuses specifically on Traceloop, OpenLLMetry, and OpenTelemetry for optimal integration.In addition, the trace system now includes intelligent AI span detection that automatically recognizes AI operations across different providers. Visual AI indicators with special badges clearly show model information at a glance. We’ve added test case generation capabilities that extract prompts and completions to easily create test cases. For better resource monitoring, token usage tracking provides detailed metrics for LLM consumption.Tracing

Examples repository

We’ve published comprehensive integration examples demonstrating OpenTelemetry configuration with Scorecard, including Python Flask implementation with LLM tracing for OpenAI and Node.js Express implementation with similar capabilities. A new setup wizard provides clear configuration instructions for popular telemetry libraries to help users get started quickly.We also updated our quickstart documentation to be more comprehensive.Examples

Bug fixes and improvements

  • [Scoring] When a run metric has not yet been scored, we now display N/A instead of NaN, making it clearer that it has no data.
  • [Prompt Management] We made stability and performance improvements to prompt management workflows.
  • [Projects] All resources now belong to projects, including those created before Scorecard Projects were introduced.
  • [Exports] Custom fields in CSV exports of run results are handled more reliably.
  • [Organizations] When a user switches organizations, we now redirect them to the organization’s projects page.
  • [Testsets] On the testcase page, we fixed the link back to the testset.
  • [Metrics] We added a new autosize textarea component that lets you keep typing the metric description without running out of space.
  • [Playground] The “Prompt manager”, “Update”, and “Delete prompt” buttons are now disabled for default prompts. When selecting metrics, the “Select and score now” button is now the primary button rather than the “Select” button.
  • [API] When a user does not include their Scorecard API key, we now return a friendlier 401 error: “Missing API key” rather than “malformed token”.
  • [Scoring] The human scoring panel collapses the run details page allowing users to see model responses and while scoring.
  • [Platform] We enhanced platform stability and increased test coverage.
March 14, 2025

New Project Overview Page

We redesigned our project overview page, including some useful information in the new sidebar and made it possible to edit the name and description of a project in the same place.Project Overview
February 24, 2025

Download Testset CSVs

You can now download an entire testset as a CSV using the ‘Export as CSV’ option.

Bug fixes and improvements

When looking at a run’s detailed results, the popover for looking at a cell’s contents would randomly disappear. This issue is now fixed.
February 13, 2025

Docs Site Revamp

We’re excited to announce we’ve moved to a completely revamped documentation site! Key improvements include:
  • Improved navigation structure
  • Better search functionality
  • Enhanced API documentation
  • New updates section to track changes
  • Modern, cleaner design
This change will help us better serve our users with clearer, more organized documentation.