Frequently Asked Questions

Getting Started

What is an eval/evaluation?

An evaluation (or “eval”) is the systematic process of testing and measuring your AI system’s performance against defined criteria. In Scorecard, an evaluation consists of:

Testset: A collection of test cases with inputs and expected outputs
System: The AI agent, prompt, API endpoint, or workflow being evaluated
Metrics: Scoring criteria (accuracy, safety, tone, etc.) with AI, Human, or Code-based evaluation
Record: A single evaluation result capturing inputs, outputs, and scores

Evaluations help you quantitatively measure AI quality, track improvements over time, and catch regressions before production. Think of it like continuous integration for AI—defining what good looks like and measuring against it systematically.

How does Scorecard differ from other AI evaluation tools?

Scorecard is the simulation platform for agent self-improvement. While most eval tools focus on observing and testing after deployment, Scorecard helps you proactively simulate thousands of realistic scenarios and apply expert judgment at scale — so you can ship frontier agent capabilities in days, not weeks.Scorecard also brings together AI researchers, engineers, and subject-matter experts into a single workflow. SMEs encode their knowledge into metrics and reward models instead of manually labeling production logs, while AI researchers build improvement flywheels instead of running QA cycles.

What is simulation and how is it different from traditional evals?

Simulation is the process of running your AI agent through thousands of realistic scenarios to test and improve its capabilities — without relying on production traffic or manual review.Traditional eval approaches treat quality as a QA problem: wait for production logs, manually review cases, and fix issues one at a time. This breaks down as agents get more complex because:

Coverage gaps: You miss edge cases until real users hit them
Slow feedback: Collecting enough production data takes days or weeks
Expert bottleneck: SMEs spend time labeling individual cases instead of teaching the system what “good” looks like
Ceiling on improvement: You’re limited to scenarios that happen to occur in production

Simulation flips this. Instead of reacting to problems in production, you proactively generate the messy, realistic interactions your agent will encounter and get feedback in minutes, not weeks.

How does Scorecard help agents self-improve?

Scorecard enables a self-improvement loop by combining realistic scenario simulation with scalable expert judgment:

Expert judgment at scale: Instead of waiting for SMEs to manually label production cases, encode their knowledge into Critic Agent Metrics that act as reward models — applying expert judgment consistently across every scenario automatically.
Fast feedback loops: Evaluate your agent through 10,000+ scenarios in minutes. Identify weaknesses, iterate, and validate improvements — all in the same day.
Broad scenario coverage: Test against tool-calling workflows, edge cases and adversarial inputs, multi-turn conversations (coming soon), and enterprise environments (coming soon).

The result: your team’s bottleneck moves from feedback (slow, getting slower) to building (fast, getting faster). Scorecard’s evaluation data also serves as high-quality training data for RLHF and reinforcement learning pipelines.

What programming languages does Scorecard support?

Scorecard provides SDKs for:

Python: Full-featured SDK with all capabilities
JavaScript/TypeScript: Complete Node.js and browser support
REST API: Universal HTTP access for any language
Framework integrations: Claude Agent SDK, LangChain, LlamaIndex, OpenAI, and more

Can Scorecard be used for agent improvement and continuous learning?

Yes! Scorecard provides multiple pathways for improving AI agents:Evaluation-Driven Improvement:

Identify failure patterns across test cases to focus improvement efforts
Track performance metrics over time to measure agent improvements
Use A/B comparison to validate that changes actually improve performance
Regression testing ensures new versions don’t break existing capabilities

Training Data Generation:

Evaluation results serve as high-quality feedback for RLHF workflows
Human-scored examples create preference pairs for fine-tuning
Score explanations provide detailed reasoning for model training
Export scored data for custom training pipelines

Continuous Feedback Loops:

Monitor production performance through tracing and observability
Create test cases from production failures for regression testing
Iterate on prompts, tools, and configurations with quantitative feedback
Multi-turn simulations test conversational improvements

See our Multi-turn Simulation and A/B Comparison docs for specific improvement workflows.

Limits and Constraints

What are the text limits in Scorecard?

Scorecard uses PostgreSQL which supports up to 1GB per text field - essentially unlimited for evaluation use cases. Practical limits are more likely to come from your AI model’s context window than database constraints.For bulk imports and large datasets, file uploads support CSV, JSON, and JSONL formats. If you encounter limitations with particularly large datasets or need custom configurations, contact support@scorecard.io for assistance.

Are there rate limits for API usage?

Scorecard implements rate limiting to ensure platform stability and fair usage across all customers. Rate limits vary by plan tier, with enterprise customers receiving custom limits based on their specific requirements.Rate limit information is included in API response headers for monitoring usage. If you need higher limits for your use case, contact support to discuss your requirements.

What is the playbook text limit?

Sim Agent Playbooks (persona instructions for multi-turn simulation) have the following limits:

Maximum playbook length: 50KB of text
Template variables: Up to 100 variables per playbook
Conversation turns: Maximum 100 turns per simulation (safety limit to prevent infinite loops)
Stop conditions: Multiple stop conditions can be combined (max turns, time, or content-based)

Playbooks support Jinja2 templating for dynamic content and can reference any field from your test cases. For example: {{item_to_return}}, {{customer_name}}, etc.

How do I create test cases?

Scorecard offers multiple ways to create test cases:UI Creation:

Manually create test cases one at a time in the Testset editor
Define custom schemas with inputs, outputs, and metadata fields
Use the visual editor for quick iteration

Bulk Import:

Upload CSV, JSON, or JSONL files with test data
Automatic schema detection from imported data
Support for large datasets (thousands of test cases)

Programmatic Creation:

Use Python or Node.js SDK to create test cases via API
Generate synthetic test cases with LLMs
Import from production logs or existing datasets

From Production Data:

Convert traces from production into test cases
Create regression tests from production failures
Sample real user interactions for evaluation

See our Testsets documentation for detailed guides.

Can I use custom AI models for evaluation?

Yes! Scorecard supports using custom models for AI-based metric evaluation:Model Options:

Scorecard-hosted models: GPT-4o, Claude 3.5 Sonnet, and other leading models
Custom endpoints: Point to your own model API for evaluation
Fine-tuned models: Use domain-specific evaluator models
Multiple models: Different metrics can use different evaluation models

Configuration:

Set model parameters (temperature, max tokens) per metric
Configure custom prompt templates in advanced mode
Control evaluation costs by selecting appropriate models

For enterprise customers, we can integrate with your on-premise models or private deployments.

What's the difference between a Run and a Record?

Run:

A collection of test executions against a testset
Contains multiple records (one per test case)
Has aggregated metrics and statistics
Represents a snapshot of your system’s performance

Record:

A single test case execution with its results
Contains inputs, outputs, and scores from all metrics
Can have multiple scores (one per metric applied)
Represents one test case within a run

Example:

Run #42: Testing customer support agent v3.1
- Record 1: Test case “refund request” → scored with 3 metrics = 3 scores
- Record 2: Test case “product inquiry” → scored with 3 metrics = 3 scores
- Record 3: Test case “complaint” → scored with 3 metrics = 3 scores
Total: 1 run, 3 records, 9 scores

See Runs & Results and Records for more details.

Features and Capabilities

How does metadata work in Scorecard?

Metadata in Scorecard allows you to store additional context without affecting evaluations:Testcase Metadata:

Mark fields as “metadata” in testset schemas
Stored with test cases but excluded from evaluation logic
Useful for tracking source, difficulty, categories, etc.

Trace Metadata:

Custom attributes on spans (user_id, session_id, etc.)
Model parameters and configuration data
Performance metrics and timing information

Run Metadata:

Git commit SHA, branch information
Environment details (staging, production)
Custom tags and labels for organization

Example usage:

{
  "input": "What's the weather?",
  "expected_output": "I'll help you check the weather",
  "source": "customer_support_logs",  // metadata
  "difficulty": "easy",              // metadata
  "created_by": "data_team"          // metadata
}

How is latency measured and reported?

Scorecard automatically captures latency metrics across your AI pipeline:Measurement Points:

End-to-end latency: Total request time from input to output
Model inference time: Time spent in model API calls
Processing time: Custom logic execution time
Network latency: Time spent in HTTP requests

Reporting:

Real-time dashboards: Live latency monitoring
Percentile analysis: P50, P90, P95, P99 latency breakdown
Trend analysis: Latency over time with alerting
Trace-level detail: Individual request timing breakdowns

Can Scorecard test agentic workflows and multi-step AI agents?

Yes! Scorecard is designed to test complex AI agents, not just simple prompts. Our platform supports:Agentic Capabilities:

Multi-turn conversations: Test agents across realistic conversation flows using Sim Agents
Tool-calling agents: Evaluate agents that use function calling and API integrations
Multi-step workflows: Version complete agent configurations including prompts, tools, and routing logic
Agent APIs: Test deployed agent endpoints without code changes

How It Works:

Use Systems to version your complete agent configuration (prompts + tools + settings)
Use Multi-turn Simulation to test conversational agents with automated user personas (Sim Agents)
Use Custom Endpoints to evaluate agent APIs and HTTP endpoints
Apply metrics to agent outputs just like any other evaluation
Use Tracing to observe and debug multi-step agent executions

See our Tracing documentation for agent-specific features.

What types of AI systems can Scorecard evaluate?

Scorecard supports evaluation of any AI agent or system accessible via API:Model Types:

Large Language Models: GPT, Claude, Llama, Gemini, etc.
Embedding Models: OpenAI, Cohere, custom embeddings
Multimodal Models: Vision, audio, and text processing
Fine-tuned Models: Custom models hosted anywhere

Agent & System Types:

Conversational Agents: Multi-turn chatbots and virtual assistants
Tool-calling Agents: Function calling and API integrations
RAG Agents: Retrieval-augmented generation pipelines
Agentic Workflows: Multi-step reasoning and planning agents
Custom APIs: Any HTTP endpoint returning AI-generated content

Deployment Options:

Cloud APIs: OpenAI, Anthropic, Google, AWS Bedrock
Self-hosted: Models and agents running on your infrastructure
Hybrid: Combination of cloud and on-premise systems

Technical Details

How does Scorecard handle sensitive data and privacy?

Scorecard is built with privacy by design principles:Data Security:

Encryption: All data encrypted in transit (TLS 1.3) and at rest (AES-256)
Access Controls: Role-based permissions and organization isolation
Audit Logging: Complete audit trail of all data access
Compliance: SOC 2 Type II, GDPR, and enterprise compliance

Privacy Controls:

Data Redaction: Automatic PII detection and masking
Data Retention: Configurable retention policies
Right to Delete: Complete data deletion capabilities
Data Residency: Control where your data is stored

See our Security and Privacy documentation for complete details.

Can I run Scorecard evaluations offline or on-premise?

Scorecard offers flexible deployment options:Cloud Service (Recommended):

Fully managed service at app.scorecard.io
Automatic updates and maintenance
Global CDN and high availability

Enterprise On-Premise:

Self-hosted deployment in your infrastructure
Complete data sovereignty and control
Custom integrations with internal systems
Available for Enterprise customers

Hybrid Approach:

Evaluation logic runs on-premise
Results optionally synced to cloud dashboard
Best of both worlds for security-sensitive organizations

Contact enterprise@scorecard.io for on-premise deployment options.

How do I migrate from other evaluation tools?

Scorecard provides migration support for common evaluation platforms:Data Migration:

Import existing test datasets (CSV, JSON, JSONL)
Convert evaluation metrics to Scorecard format
Migrate historical evaluation results

Common Migrations:

From custom scripts: Convert to Scorecard SDK calls
From academic benchmarks: Import MMLU, HellaSwag, etc.
From other platforms: Bulk export/import workflows

Migration Assistance:

Free migration consultation for Enterprise customers
Custom scripts for complex data transformations
Parallel running during transition period

Contact our support team at support@scorecard.io for personalized migration assistance.

Billing and Plans

How does Scorecard pricing work?

Scorecard offers flexible pricing based on usage:Starter (Free):

Unlimited users
100,000 scores per month
Essential evaluation features for early-stage AI projects

Growth ($299/month):

Unlimited users
Includes 1M scores per month, then $1 per 5K additional
Test set management
Prompt playground access
Priority support

Enterprise (Custom pricing):

Custom solutions for large-scale AI deployments
SAML SSO and enterprise authentication
Dedicated support and customer success
Custom compliance and security features

Visit scorecard.io/pricing for the most current pricing information.

What counts as a score?

A score is counted each time Scorecard evaluates a single test case with a metric:Examples:

1 test case × 1 metric = 1 score
1 test case × 3 metrics = 3 scores
100 test cases × 2 metrics = 200 scores

Not Counted:

Viewing existing results
Creating/editing test cases
Monitoring and tracing (separate feature)
API calls for data retrieval

Bulk Discounts: Enterprise customers get volume discounts for large-scale evaluations.

Getting Started

Quick start guide for new users

Contact Support

Get help with setup and migration

Status Page

Real-time platform status and uptime

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

Frequently Asked Questions

Getting Started

Limits and Constraints

Features and Capabilities

Technical Details

Billing and Plans

Getting Started

Contact Support

Status Page

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

​Getting Started

​Limits and Constraints

​Features and Capabilities

​Technical Details

​Billing and Plans

​Related Resources

Getting Started

Contact Support

Status Page

Getting Started

Limits and Constraints

Features and Capabilities

Technical Details

Billing and Plans

Related Resources