Getting Started
What is an eval/evaluation?
What is an eval/evaluation?
- Testset: A collection of input/output pairs to test against
- System: The AI model, prompt, or API endpoint being evaluated
- Metrics: Criteria used to score performance (accuracy, tone, safety, etc.)
- Run: The execution of the evaluation across all test cases
How does Scorecard differ from other AI evaluation tools?
How does Scorecard differ from other AI evaluation tools?
What programming languages does Scorecard support?
What programming languages does Scorecard support?
- Python: Full-featured SDK with all capabilities
- JavaScript/TypeScript: Complete Node.js and browser support
- REST API: Universal HTTP access for any language
- Framework integrations: LangChain, LlamaIndex, OpenAI, and more
Can Scorecard be used for RLHF and agent training?
Can Scorecard be used for RLHF and agent training?
Limits and Constraints
What are the text limits in Scorecard?
What are the text limits in Scorecard?
Are there rate limits for API usage?
Are there rate limits for API usage?
What is the playbook text limit?
What is the playbook text limit?
- Maximum playbook length: 50KB of text
- Template variables: Up to 100 variables per playbook
- Conversation turns: Maximum 50 turns per simulation (safety limit)
- Stop conditions: Up to 10 custom stop conditions per simulation
Features and Capabilities
How does metadata work in Scorecard?
How does metadata work in Scorecard?
- Mark fields as “metadata” in testset schemas
- Stored with test cases but excluded from evaluation logic
- Useful for tracking source, difficulty, categories, etc.
- Custom attributes on spans (user_id, session_id, etc.)
- Model parameters and configuration data
- Performance metrics and timing information
- Git commit SHA, branch information
- Environment details (staging, production)
- Custom tags and labels for organization
How is latency measured and reported?
How is latency measured and reported?
- End-to-end latency: Total request time from input to output
- Model inference time: Time spent in model API calls
- Processing time: Custom logic execution time
- Network latency: Time spent in HTTP requests
- Real-time dashboards: Live latency monitoring
- Percentile analysis: P50, P90, P95, P99 latency breakdown
- Trend analysis: Latency over time with alerting
- Trace-level detail: Individual request timing breakdowns
Can Scorecard test agentic workflows and multi-step AI agents?
Can Scorecard test agentic workflows and multi-step AI agents?
- Multi-turn conversations: Test agents across realistic conversation flows using Sim Agents
- Tool-calling agents: Evaluate agents that use function calling and API integrations
- Multi-step workflows: Version complete agent configurations including prompts, tools, and routing logic
- Agent APIs: Test deployed agent endpoints without code changes
- Use Systems to version your complete agent configuration (prompts + tools + settings)
- Use Multi-turn Simulation to test conversational agents with automated user personas
- Use Custom Endpoints to evaluate agent APIs and HTTP endpoints
- Apply metrics to agent outputs just like any other evaluation
What types of AI systems can Scorecard evaluate?
What types of AI systems can Scorecard evaluate?
- Large Language Models: GPT, Claude, Llama, Gemini, etc.
- Embedding Models: OpenAI, Cohere, custom embeddings
- Multimodal Models: Vision, audio, and text processing
- Fine-tuned Models: Custom models hosted anywhere
- Conversational Agents: Multi-turn chatbots and virtual assistants
- Tool-calling Agents: Function calling and API integrations
- RAG Agents: Retrieval-augmented generation pipelines
- Agentic Workflows: Multi-step reasoning and planning agents
- Custom APIs: Any HTTP endpoint returning AI-generated content
- Cloud APIs: OpenAI, Anthropic, Google, AWS Bedrock
- Self-hosted: Models and agents running on your infrastructure
- Hybrid: Combination of cloud and on-premise systems
Technical Details
How does Scorecard handle sensitive data and privacy?
How does Scorecard handle sensitive data and privacy?
- Encryption: All data encrypted in transit (TLS 1.3) and at rest (AES-256)
- Access Controls: Role-based permissions and organization isolation
- Audit Logging: Complete audit trail of all data access
- Compliance: SOC 2 Type II, GDPR, and enterprise compliance
- Data Redaction: Automatic PII detection and masking
- Data Retention: Configurable retention policies
- Right to Delete: Complete data deletion capabilities
- Data Residency: Control where your data is stored
Can I run Scorecard evaluations offline or on-premise?
Can I run Scorecard evaluations offline or on-premise?
- Fully managed service at app.scorecard.io
- Automatic updates and maintenance
- Global CDN and high availability
- Self-hosted deployment in your infrastructure
- Complete data sovereignty and control
- Custom integrations with internal systems
- Available for Enterprise customers
- Evaluation logic runs on-premise
- Results optionally synced to cloud dashboard
- Best of both worlds for security-sensitive organizations
How do I migrate from other evaluation tools?
How do I migrate from other evaluation tools?
- Import existing test datasets (CSV, JSON, JSONL)
- Convert evaluation metrics to Scorecard format
- Migrate historical evaluation results
- From custom scripts: Convert to Scorecard SDK calls
- From academic benchmarks: Import MMLU, HellaSwag, etc.
- From other platforms: Bulk export/import workflows
- Free migration consultation for Enterprise customers
- Custom scripts for complex data transformations
- Parallel running during transition period
Billing and Plans
How does Scorecard pricing work?
How does Scorecard pricing work?
- Unlimited users
- 100,000 scores per month
- Essential evaluation features for early-stage AI projects
- Unlimited users
- Includes 1M scores per month, then $1 per 5K additional
- Test set management
- Prompt playground access
- Priority support
- Custom solutions for large-scale AI deployments
- SAML SSO and enterprise authentication
- Dedicated support and customer success
- Custom compliance and security features
What counts as a score?
What counts as a score?
- 1 test case × 1 metric = 1 score
- 1 test case × 3 metrics = 3 scores
- 100 test cases × 2 metrics = 200 scores
- Viewing existing results
- Creating/editing test cases
- Monitoring and tracing (separate feature)
- API calls for data retrieval