Skip to main content

Getting Started

An evaluation (or “eval”) is the systematic process of testing and measuring your AI system’s performance against defined criteria. In Scorecard, an evaluation consists of:
  • Testset: A collection of test cases with inputs and expected outputs
  • System: The AI agent, prompt, API endpoint, or workflow being evaluated
  • Metrics: Scoring criteria (accuracy, safety, tone, etc.) with AI, Human, or Code-based evaluation
  • Record: A single evaluation result capturing inputs, outputs, and scores
Evaluations help you quantitatively measure AI quality, track improvements over time, and catch regressions before production. Think of it like continuous integration for AI—defining what good looks like and measuring against it systematically.
Scorecard is the simulation platform for agent self-improvement. While most eval tools focus on observing and testing after deployment, Scorecard helps you proactively simulate thousands of realistic scenarios and apply expert judgment at scale — so you can ship frontier agent capabilities in days, not weeks.Scorecard also brings together AI researchers, engineers, and subject-matter experts into a single workflow. SMEs encode their knowledge into metrics and reward models instead of manually labeling production logs, while AI researchers build improvement flywheels instead of running QA cycles.
Simulation is the process of running your AI agent through thousands of realistic scenarios to test and improve its capabilities — without relying on production traffic or manual review.Traditional eval approaches treat quality as a QA problem: wait for production logs, manually review cases, and fix issues one at a time. This breaks down as agents get more complex because:
  • Coverage gaps: You miss edge cases until real users hit them
  • Slow feedback: Collecting enough production data takes days or weeks
  • Expert bottleneck: SMEs spend time labeling individual cases instead of teaching the system what “good” looks like
  • Ceiling on improvement: You’re limited to scenarios that happen to occur in production
Simulation flips this. Instead of reacting to problems in production, you proactively generate the messy, realistic interactions your agent will encounter and get feedback in minutes, not weeks.
Scorecard enables a self-improvement loop by combining realistic scenario simulation with scalable expert judgment:
  • Expert judgment at scale: Instead of waiting for SMEs to manually label production cases, encode their knowledge into Critic Agent Metrics that act as reward models — applying expert judgment consistently across every scenario automatically.
  • Fast feedback loops: Evaluate your agent through 10,000+ scenarios in minutes. Identify weaknesses, iterate, and validate improvements — all in the same day.
  • Broad scenario coverage: Test against tool-calling workflows, edge cases and adversarial inputs, multi-turn conversations (coming soon), and enterprise environments (coming soon).
The result: your team’s bottleneck moves from feedback (slow, getting slower) to building (fast, getting faster). Scorecard’s evaluation data also serves as high-quality training data for RLHF and reinforcement learning pipelines.
Scorecard provides SDKs for:
  • Python: Full-featured SDK with all capabilities
  • JavaScript/TypeScript: Complete Node.js and browser support
  • REST API: Universal HTTP access for any language
  • Framework integrations: Claude Agent SDK, LangChain, LlamaIndex, OpenAI, and more
Yes! Scorecard provides multiple pathways for improving AI agents:Evaluation-Driven Improvement:
  • Identify failure patterns across test cases to focus improvement efforts
  • Track performance metrics over time to measure agent improvements
  • Use A/B comparison to validate that changes actually improve performance
  • Regression testing ensures new versions don’t break existing capabilities
Training Data Generation:
  • Evaluation results serve as high-quality feedback for RLHF workflows
  • Human-scored examples create preference pairs for fine-tuning
  • Score explanations provide detailed reasoning for model training
  • Export scored data for custom training pipelines
Continuous Feedback Loops:
  • Monitor production performance through tracing and observability
  • Create test cases from production failures for regression testing
  • Iterate on prompts, tools, and configurations with quantitative feedback
  • Multi-turn simulations test conversational improvements
See our Multi-turn Simulation and A/B Comparison docs for specific improvement workflows.

Limits and Constraints

Scorecard uses PostgreSQL which supports up to 1GB per text field - essentially unlimited for evaluation use cases. Practical limits are more likely to come from your AI model’s context window than database constraints.For bulk imports and large datasets, file uploads support CSV, JSON, and JSONL formats. If you encounter limitations with particularly large datasets or need custom configurations, contact support@scorecard.io for assistance.
Scorecard implements rate limiting to ensure platform stability and fair usage across all customers. Rate limits vary by plan tier, with enterprise customers receiving custom limits based on their specific requirements.Rate limit information is included in API response headers for monitoring usage. If you need higher limits for your use case, contact support to discuss your requirements.
Sim Agent Playbooks (persona instructions for multi-turn simulation) have the following limits:
  • Maximum playbook length: 50KB of text
  • Template variables: Up to 100 variables per playbook
  • Conversation turns: Maximum 100 turns per simulation (safety limit to prevent infinite loops)
  • Stop conditions: Multiple stop conditions can be combined (max turns, time, or content-based)
Playbooks support Jinja2 templating for dynamic content and can reference any field from your test cases. For example: {{item_to_return}}, {{customer_name}}, etc.
Scorecard offers multiple ways to create test cases:UI Creation:
  • Manually create test cases one at a time in the Testset editor
  • Define custom schemas with inputs, outputs, and metadata fields
  • Use the visual editor for quick iteration
Bulk Import:
  • Upload CSV, JSON, or JSONL files with test data
  • Automatic schema detection from imported data
  • Support for large datasets (thousands of test cases)
Programmatic Creation:
  • Use Python or Node.js SDK to create test cases via API
  • Generate synthetic test cases with LLMs
  • Import from production logs or existing datasets
From Production Data:
  • Convert traces from production into test cases
  • Create regression tests from production failures
  • Sample real user interactions for evaluation
See our Testsets documentation for detailed guides.
Yes! Scorecard supports using custom models for AI-based metric evaluation:Model Options:
  • Scorecard-hosted models: GPT-4o, Claude 3.5 Sonnet, and other leading models
  • Custom endpoints: Point to your own model API for evaluation
  • Fine-tuned models: Use domain-specific evaluator models
  • Multiple models: Different metrics can use different evaluation models
Configuration:
  • Set model parameters (temperature, max tokens) per metric
  • Configure custom prompt templates in advanced mode
  • Control evaluation costs by selecting appropriate models
For enterprise customers, we can integrate with your on-premise models or private deployments.
Run:
  • A collection of test executions against a testset
  • Contains multiple records (one per test case)
  • Has aggregated metrics and statistics
  • Represents a snapshot of your system’s performance
Record:
  • A single test case execution with its results
  • Contains inputs, outputs, and scores from all metrics
  • Can have multiple scores (one per metric applied)
  • Represents one test case within a run
Example:
  • Run #42: Testing customer support agent v3.1
    • Record 1: Test case “refund request” → scored with 3 metrics = 3 scores
    • Record 2: Test case “product inquiry” → scored with 3 metrics = 3 scores
    • Record 3: Test case “complaint” → scored with 3 metrics = 3 scores
  • Total: 1 run, 3 records, 9 scores
See Runs & Results and Records for more details.

Features and Capabilities

Metadata in Scorecard allows you to store additional context without affecting evaluations:Testcase Metadata:
  • Mark fields as “metadata” in testset schemas
  • Stored with test cases but excluded from evaluation logic
  • Useful for tracking source, difficulty, categories, etc.
Trace Metadata:
  • Custom attributes on spans (user_id, session_id, etc.)
  • Model parameters and configuration data
  • Performance metrics and timing information
Run Metadata:
  • Git commit SHA, branch information
  • Environment details (staging, production)
  • Custom tags and labels for organization
Example usage:
{
  "input": "What's the weather?",
  "expected_output": "I'll help you check the weather",
  "source": "customer_support_logs",  // metadata
  "difficulty": "easy",              // metadata
  "created_by": "data_team"          // metadata
}
Scorecard automatically captures latency metrics across your AI pipeline:Measurement Points:
  • End-to-end latency: Total request time from input to output
  • Model inference time: Time spent in model API calls
  • Processing time: Custom logic execution time
  • Network latency: Time spent in HTTP requests
Reporting:
  • Real-time dashboards: Live latency monitoring
  • Percentile analysis: P50, P90, P95, P99 latency breakdown
  • Trend analysis: Latency over time with alerting
  • Trace-level detail: Individual request timing breakdowns
Yes! Scorecard is designed to test complex AI agents, not just simple prompts. Our platform supports:Agentic Capabilities:
  • Multi-turn conversations: Test agents across realistic conversation flows using Sim Agents
  • Tool-calling agents: Evaluate agents that use function calling and API integrations
  • Multi-step workflows: Version complete agent configurations including prompts, tools, and routing logic
  • Agent APIs: Test deployed agent endpoints without code changes
How It Works:
  • Use Systems to version your complete agent configuration (prompts + tools + settings)
  • Use Multi-turn Simulation to test conversational agents with automated user personas (Sim Agents)
  • Use Custom Endpoints to evaluate agent APIs and HTTP endpoints
  • Apply metrics to agent outputs just like any other evaluation
  • Use Tracing to observe and debug multi-step agent executions
See our Tracing documentation for agent-specific features.
Scorecard supports evaluation of any AI agent or system accessible via API:Model Types:
  • Large Language Models: GPT, Claude, Llama, Gemini, etc.
  • Embedding Models: OpenAI, Cohere, custom embeddings
  • Multimodal Models: Vision, audio, and text processing
  • Fine-tuned Models: Custom models hosted anywhere
Agent & System Types:
  • Conversational Agents: Multi-turn chatbots and virtual assistants
  • Tool-calling Agents: Function calling and API integrations
  • RAG Agents: Retrieval-augmented generation pipelines
  • Agentic Workflows: Multi-step reasoning and planning agents
  • Custom APIs: Any HTTP endpoint returning AI-generated content
Deployment Options:
  • Cloud APIs: OpenAI, Anthropic, Google, AWS Bedrock
  • Self-hosted: Models and agents running on your infrastructure
  • Hybrid: Combination of cloud and on-premise systems

Technical Details

Scorecard is built with privacy by design principles:Data Security:
  • Encryption: All data encrypted in transit (TLS 1.3) and at rest (AES-256)
  • Access Controls: Role-based permissions and organization isolation
  • Audit Logging: Complete audit trail of all data access
  • Compliance: SOC 2 Type II, GDPR, and enterprise compliance
Privacy Controls:
  • Data Redaction: Automatic PII detection and masking
  • Data Retention: Configurable retention policies
  • Right to Delete: Complete data deletion capabilities
  • Data Residency: Control where your data is stored
See our Security and Privacy documentation for complete details.
Scorecard offers flexible deployment options:Cloud Service (Recommended):
  • Fully managed service at app.scorecard.io
  • Automatic updates and maintenance
  • Global CDN and high availability
Enterprise On-Premise:
  • Self-hosted deployment in your infrastructure
  • Complete data sovereignty and control
  • Custom integrations with internal systems
  • Available for Enterprise customers
Hybrid Approach:
  • Evaluation logic runs on-premise
  • Results optionally synced to cloud dashboard
  • Best of both worlds for security-sensitive organizations
Contact enterprise@scorecard.io for on-premise deployment options.
Scorecard provides migration support for common evaluation platforms:Data Migration:
  • Import existing test datasets (CSV, JSON, JSONL)
  • Convert evaluation metrics to Scorecard format
  • Migrate historical evaluation results
Common Migrations:
  • From custom scripts: Convert to Scorecard SDK calls
  • From academic benchmarks: Import MMLU, HellaSwag, etc.
  • From other platforms: Bulk export/import workflows
Migration Assistance:
  • Free migration consultation for Enterprise customers
  • Custom scripts for complex data transformations
  • Parallel running during transition period
Contact our support team at support@scorecard.io for personalized migration assistance.

Billing and Plans

Scorecard offers flexible pricing based on usage:Starter (Free):
  • Unlimited users
  • 100,000 scores per month
  • Essential evaluation features for early-stage AI projects
Growth ($299/month):
  • Unlimited users
  • Includes 1M scores per month, then $1 per 5K additional
  • Test set management
  • Prompt playground access
  • Priority support
Enterprise (Custom pricing):
  • Custom solutions for large-scale AI deployments
  • SAML SSO and enterprise authentication
  • Dedicated support and customer success
  • Custom compliance and security features
Visit scorecard.io/pricing for the most current pricing information.
A score is counted each time Scorecard evaluates a single test case with a metric:Examples:
  • 1 test case × 1 metric = 1 score
  • 1 test case × 3 metrics = 3 scores
  • 100 test cases × 2 metrics = 200 scores
Not Counted:
  • Viewing existing results
  • Creating/editing test cases
  • Monitoring and tracing (separate feature)
  • API calls for data retrieval
Bulk Discounts: Enterprise customers get volume discounts for large-scale evaluations.

Getting Started

Quick start guide for new users

Contact Support

Get help with setup and migration

Status Page

Real-time platform status and uptime