Getting Started
What is an eval/evaluation?
What is an eval/evaluation?
- Testset: A collection of test cases with inputs and expected outputs
- System: The AI agent, prompt, API endpoint, or workflow being evaluated
- Metrics: Scoring criteria (accuracy, safety, tone, etc.) with AI, Human, or Code-based evaluation
- Record: A single evaluation result capturing inputs, outputs, and scores
How does Scorecard differ from other AI evaluation tools?
How does Scorecard differ from other AI evaluation tools?
What is simulation and how is it different from traditional evals?
What is simulation and how is it different from traditional evals?
- Coverage gaps: You miss edge cases until real users hit them
- Slow feedback: Collecting enough production data takes days or weeks
- Expert bottleneck: SMEs spend time labeling individual cases instead of teaching the system what “good” looks like
- Ceiling on improvement: You’re limited to scenarios that happen to occur in production
How does Scorecard help agents self-improve?
How does Scorecard help agents self-improve?
- Expert judgment at scale: Instead of waiting for SMEs to manually label production cases, encode their knowledge into Critic Agent Metrics that act as reward models — applying expert judgment consistently across every scenario automatically.
- Fast feedback loops: Evaluate your agent through 10,000+ scenarios in minutes. Identify weaknesses, iterate, and validate improvements — all in the same day.
- Broad scenario coverage: Test against tool-calling workflows, edge cases and adversarial inputs, multi-turn conversations (coming soon), and enterprise environments (coming soon).
What programming languages does Scorecard support?
What programming languages does Scorecard support?
- Python: Full-featured SDK with all capabilities
- JavaScript/TypeScript: Complete Node.js and browser support
- REST API: Universal HTTP access for any language
- Framework integrations: Claude Agent SDK, LangChain, LlamaIndex, OpenAI, and more
Can Scorecard be used for agent improvement and continuous learning?
Can Scorecard be used for agent improvement and continuous learning?
- Identify failure patterns across test cases to focus improvement efforts
- Track performance metrics over time to measure agent improvements
- Use A/B comparison to validate that changes actually improve performance
- Regression testing ensures new versions don’t break existing capabilities
- Evaluation results serve as high-quality feedback for RLHF workflows
- Human-scored examples create preference pairs for fine-tuning
- Score explanations provide detailed reasoning for model training
- Export scored data for custom training pipelines
- Monitor production performance through tracing and observability
- Create test cases from production failures for regression testing
- Iterate on prompts, tools, and configurations with quantitative feedback
- Multi-turn simulations test conversational improvements
Limits and Constraints
What are the text limits in Scorecard?
What are the text limits in Scorecard?
Are there rate limits for API usage?
Are there rate limits for API usage?
What is the playbook text limit?
What is the playbook text limit?
- Maximum playbook length: 50KB of text
- Template variables: Up to 100 variables per playbook
- Conversation turns: Maximum 100 turns per simulation (safety limit to prevent infinite loops)
- Stop conditions: Multiple stop conditions can be combined (max turns, time, or content-based)
{{item_to_return}}, {{customer_name}}, etc.How do I create test cases?
How do I create test cases?
- Manually create test cases one at a time in the Testset editor
- Define custom schemas with inputs, outputs, and metadata fields
- Use the visual editor for quick iteration
- Upload CSV, JSON, or JSONL files with test data
- Automatic schema detection from imported data
- Support for large datasets (thousands of test cases)
- Use Python or Node.js SDK to create test cases via API
- Generate synthetic test cases with LLMs
- Import from production logs or existing datasets
- Convert traces from production into test cases
- Create regression tests from production failures
- Sample real user interactions for evaluation
Can I use custom AI models for evaluation?
Can I use custom AI models for evaluation?
- Scorecard-hosted models: GPT-4o, Claude 3.5 Sonnet, and other leading models
- Custom endpoints: Point to your own model API for evaluation
- Fine-tuned models: Use domain-specific evaluator models
- Multiple models: Different metrics can use different evaluation models
- Set model parameters (temperature, max tokens) per metric
- Configure custom prompt templates in advanced mode
- Control evaluation costs by selecting appropriate models
What's the difference between a Run and a Record?
What's the difference between a Run and a Record?
- A collection of test executions against a testset
- Contains multiple records (one per test case)
- Has aggregated metrics and statistics
- Represents a snapshot of your system’s performance
- A single test case execution with its results
- Contains inputs, outputs, and scores from all metrics
- Can have multiple scores (one per metric applied)
- Represents one test case within a run
- Run #42: Testing customer support agent v3.1
- Record 1: Test case “refund request” → scored with 3 metrics = 3 scores
- Record 2: Test case “product inquiry” → scored with 3 metrics = 3 scores
- Record 3: Test case “complaint” → scored with 3 metrics = 3 scores
- Total: 1 run, 3 records, 9 scores
Features and Capabilities
How does metadata work in Scorecard?
How does metadata work in Scorecard?
- Mark fields as “metadata” in testset schemas
- Stored with test cases but excluded from evaluation logic
- Useful for tracking source, difficulty, categories, etc.
- Custom attributes on spans (user_id, session_id, etc.)
- Model parameters and configuration data
- Performance metrics and timing information
- Git commit SHA, branch information
- Environment details (staging, production)
- Custom tags and labels for organization
How is latency measured and reported?
How is latency measured and reported?
- End-to-end latency: Total request time from input to output
- Model inference time: Time spent in model API calls
- Processing time: Custom logic execution time
- Network latency: Time spent in HTTP requests
- Real-time dashboards: Live latency monitoring
- Percentile analysis: P50, P90, P95, P99 latency breakdown
- Trend analysis: Latency over time with alerting
- Trace-level detail: Individual request timing breakdowns
Can Scorecard test agentic workflows and multi-step AI agents?
Can Scorecard test agentic workflows and multi-step AI agents?
- Multi-turn conversations: Test agents across realistic conversation flows using Sim Agents
- Tool-calling agents: Evaluate agents that use function calling and API integrations
- Multi-step workflows: Version complete agent configurations including prompts, tools, and routing logic
- Agent APIs: Test deployed agent endpoints without code changes
- Use Systems to version your complete agent configuration (prompts + tools + settings)
- Use Multi-turn Simulation to test conversational agents with automated user personas (Sim Agents)
- Use Custom Endpoints to evaluate agent APIs and HTTP endpoints
- Apply metrics to agent outputs just like any other evaluation
- Use Tracing to observe and debug multi-step agent executions
What types of AI systems can Scorecard evaluate?
What types of AI systems can Scorecard evaluate?
- Large Language Models: GPT, Claude, Llama, Gemini, etc.
- Embedding Models: OpenAI, Cohere, custom embeddings
- Multimodal Models: Vision, audio, and text processing
- Fine-tuned Models: Custom models hosted anywhere
- Conversational Agents: Multi-turn chatbots and virtual assistants
- Tool-calling Agents: Function calling and API integrations
- RAG Agents: Retrieval-augmented generation pipelines
- Agentic Workflows: Multi-step reasoning and planning agents
- Custom APIs: Any HTTP endpoint returning AI-generated content
- Cloud APIs: OpenAI, Anthropic, Google, AWS Bedrock
- Self-hosted: Models and agents running on your infrastructure
- Hybrid: Combination of cloud and on-premise systems
Technical Details
How does Scorecard handle sensitive data and privacy?
How does Scorecard handle sensitive data and privacy?
- Encryption: All data encrypted in transit (TLS 1.3) and at rest (AES-256)
- Access Controls: Role-based permissions and organization isolation
- Audit Logging: Complete audit trail of all data access
- Compliance: SOC 2 Type II, GDPR, and enterprise compliance
- Data Redaction: Automatic PII detection and masking
- Data Retention: Configurable retention policies
- Right to Delete: Complete data deletion capabilities
- Data Residency: Control where your data is stored
Can I run Scorecard evaluations offline or on-premise?
Can I run Scorecard evaluations offline or on-premise?
- Fully managed service at app.scorecard.io
- Automatic updates and maintenance
- Global CDN and high availability
- Self-hosted deployment in your infrastructure
- Complete data sovereignty and control
- Custom integrations with internal systems
- Available for Enterprise customers
- Evaluation logic runs on-premise
- Results optionally synced to cloud dashboard
- Best of both worlds for security-sensitive organizations
How do I migrate from other evaluation tools?
How do I migrate from other evaluation tools?
- Import existing test datasets (CSV, JSON, JSONL)
- Convert evaluation metrics to Scorecard format
- Migrate historical evaluation results
- From custom scripts: Convert to Scorecard SDK calls
- From academic benchmarks: Import MMLU, HellaSwag, etc.
- From other platforms: Bulk export/import workflows
- Free migration consultation for Enterprise customers
- Custom scripts for complex data transformations
- Parallel running during transition period
Billing and Plans
How does Scorecard pricing work?
How does Scorecard pricing work?
- Unlimited users
- 100,000 scores per month
- Essential evaluation features for early-stage AI projects
- Unlimited users
- Includes 1M scores per month, then $1 per 5K additional
- Test set management
- Prompt playground access
- Priority support
- Custom solutions for large-scale AI deployments
- SAML SSO and enterprise authentication
- Dedicated support and customer success
- Custom compliance and security features
What counts as a score?
What counts as a score?
- 1 test case × 1 metric = 1 score
- 1 test case × 3 metrics = 3 scores
- 100 test cases × 2 metrics = 200 scores
- Viewing existing results
- Creating/editing test cases
- Monitoring and tracing (separate feature)
- API calls for data retrieval