Skip to main content

Overview

Scorecard’s MCP (Model Context Protocol) server transforms AI assistants like Claude and Cursor into conversational AI evaluation companions. With natural language commands, you can manage projects, create testsets, configure metrics, run evaluations, and analyze results—all through your favorite AI assistant’s interface.

Setting Up the MCP Server

Prerequisites

  • An MCP-compatible client (Claude Desktop, Cursor, or other MCP clients)
  • A Scorecard account with API access
You can install the Scorecard remote MCP server without any dependencies.

Claude Desktop

Go to Claude Desktop settings page and click on the “Connectors” tab. Click “Add custom connector” and paste the following URL: https://mcp.scorecard.io/mcp. Click “Add” on the modal, then click “Connect” on the modal to login to Scorecard.

Screenshot of the Claude Desktop MCP connector

Local configuration

You can directly run the MCP Server locally via npx:
export SCORECARD_API_KEY="My API Key"
npx -y scorecard-ai-mcp@latest
If you already have a client, consult their documentation to install the MCP server. For clients with a configuration JSON, it might look something like this:
{
  "mcpServers": {
    "scorecard_ai": {
      "command": "npx",
      "args": ["-y", "scorecard-ai-mcp", "--client=claude", "--tools=dynamic"],
      "env": {
        "SCORECARD_API_KEY": "ak_MyAPIKey",
      }
    }
  }
}
The MCP server uses Clerk OAuth authentication and JWT tokens to securely connect to your Scorecard account. The configuration is identical across all MCP clients—simply add it to your client’s MCP settings.

Core Capabilities

The MCP server provides natural language access to Scorecard’s core functionality:
Create and manage evaluation projects for your AI systems.Example Commands:
  • “Create a new project for evaluating my customer service chatbot”
  • “Show me all my current projects”
  • “Set up a project for testing my RAG pipeline”
Build comprehensive testsets with various scenarios and edge cases.Example Commands:
  • “Create a testset for customer service scenarios”
  • “Add 20 testcases covering product returns and refunds”
  • “Import testcases from my CSV file”
Organize and categorize your testcases for systematic evaluation.Example Commands:
  • “Group testcases by difficulty level”
  • “Add tags for ‘edge cases’ and ‘common queries’”
  • “Show me all testcases related to billing issues”
Define metrics that matter for your specific use case.Example Commands:
  • “Configure accuracy and helpfulness metrics”
  • “Add a custom metric for response relevance”
  • “Set up hallucination detection scoring”
Track different versions of your AI systems and models.Example Commands:
  • “Register my GPT-4 based assistant as version 1.0”
  • “Create a new version for my updated prompt template”
  • “Compare versions 1.0 and 2.0 of my chatbot”
Execute evaluations and analyze performance results.Example Commands:
  • “Run an evaluation against my latest model”
  • “Show me the performance results from yesterday’s run”
  • “Compare accuracy across the last 5 evaluation runs”

Example Workflows

Complete Evaluation Setup

Here’s how you might set up a complete evaluation workflow using natural language in any MCP client:
1

Create a Project

“Create a new project called ‘Customer Support Bot v2’ for evaluating my updated support assistant”
2

Define Testcases

“Create a testset with 50 diverse customer support scenarios including billing, technical issues, and product inquiries”
3

Configure Metrics

“Set up metrics for accuracy, response helpfulness, hallucination rate, and response time”
4

Register Your Model

“Register my current GPT-4 based assistant with custom prompts as version 2.0”
5

Run Evaluation

“Run a full evaluation of version 2.0 against all testcases”
6

Analyze Results

“Show me areas where the model is underperforming and suggest improvements”

Continuous Testing Workflow

  • Daily Testing
  • A/B Testing
  • Regression Testing
Example Commands:
  • “Run daily evaluation on production model”
  • “Alert me if accuracy drops below 85%”
  • “Generate weekly performance report”

Advanced Use Cases

Multi-Model Comparison

Use your AI assistant to orchestrate complex multi-model evaluations: Example Commands:
  • “Compare GPT-4, Claude 3, and Llama 3 on my customer service testset”
  • “Evaluate cost-performance tradeoffs between models”
  • “Recommend the best model for my use case”

Automated Test Generation

Leverage your AI assistant’s understanding to create comprehensive test suites: Example Commands:
  • “Generate 100 edge cases for my medical diagnosis assistant”
  • “Create adversarial testcases to test robustness”
  • “Build a testset from real user conversations”

Performance Optimization

Get insights and recommendations for improving your AI systems: Example Commands:
  • “Analyze failure patterns in my evaluation results”
  • “Suggest prompt improvements based on errors”
  • “Identify which types of queries need more training data”

Technical Architecture

The MCP server is:
  • Built on the Model Context Protocol standard
  • Compatible with any MCP client (Claude Desktop, Cursor, and more)
  • Deployed on Vercel edge infrastructure for low latency
  • Secured with Clerk OAuth authentication
  • Open source and available on GitHub
The MCP server is continuously updated with new capabilities. Check the GitHub repository for the latest features and updates.

Getting Help

If you encounter issues or have questions about the MCP server:
  1. Check the GitHub repository for documentation
  2. Open an issue for bugs or feature requests
  3. Contact Scorecard support (support@scorecard.io) for account-related questions
I