New to Scorecard? Head straight to the Tracing Quickstart or jump into our ready-to-run Google Colab notebook to see traces in under 5 minutes.
Why Tracing matters
- Debug long or failing requests in seconds.
- Audit prompts & completions for compliance and safety.
- Attribute quality and cost back to specific services or users.
- Feed production traffic into evaluations and monitoring.
If you call it something else
- Observability / AI spans / request logs: We capture standard OpenTelemetry traces and spans for LLM calls and related operations.
- Agent runs / tools / function calls: These appear as nested spans in the trace tree, with inputs/outputs when available.
- Prompt/Completion pairs: Extracted from common keys (
openinference.*
,ai.prompt
/ai.response
,gen_ai.*
) so they can be turned into testcases and scored.
Instrument once, capture everything
Explore traces in Scorecard

Traces dashboard – search, filters, cost & scores
- Timestamp & duration – when and how long the request ran.
- Service & span tree – navigate nested function/tool calls (see code reference in
trace-table.tsx
). - Token & cost breakdown – estimate spend per trace via model pricing.
- Scores column – if a trace links to an evaluation run the results appear inline (
TraceScoresCell
). - Full-text search & filters – search any span attribute (
searchText
) or limit to a specific project/time-range. - Copyable Trace ID – quickly copy and share trace identifiers.
Search & filters
- Time ranges: 30m, 24h, 3d, 7d, 30d, All.
- Project scope: toggle between Current project and All projects.
- SearchText: full‑text across span/resource attributes (including prompt/response fields).
- Match previews: quick context snippets with deep links to traces.
- Cursor pagination: efficient browsing with shareable URLs.
Turn traces to testcases
Live traffic exposes edge-cases synthetic datasets miss. From any span that contains prompt/response attributes click Create Testcase and Scorecard will:- Extract
openinference.*
,ai.prompt
/ai.response
, orgen_ai.*
fields. - Save the pair into a chosen Testset.
- Make it immediately available for offline evaluation runs.
Continuous Monitoring
Tracing is the foundation for production quality tracking. Monitors periodically sample recent LLM spans, score them with your chosen metrics and surface trends right back in the traces view.
Monitor results – production traces with scores

Traces search with monitor scores
- Select metrics, frequency, sample rate & filters (including full-text
searchText
). - Scores appear inline on the Traces page and aggregate in the Runs section.
- Detect drift and regressions without extra instrumentation.
AI-Specific Error Detection
Scorecard’s tracing goes beyond technical failures to detect AI-specific behavioral issues that traditional monitoring misses. The system acts as an always-on watchdog, analyzing every AI interaction to catch both obvious technical errors and subtle behavioral problems that could impact user experience.Silent Failure Detection
The most dangerous errors in AI systems are “silent failures” where your AI responds but incorrectly. Scorecard automatically detects behavioral errors including off-topic responses, workflow interruptions, safety violations, hallucinations, and context loss. These silent failures often go unnoticed without specialized AI monitoring but can severely impact user trust and application effectiveness. Technical errors like rate limits, timeouts, and API failures are captured automatically through standard trace error recording. However, AI applications also face unique challenges like semantic drift, safety policy violations, factual accuracy issues, and task completion failures that require intelligent analysis beyond traditional error logging.Custom Error Detection
Create custom metrics through Scorecard’s UI to detect application-specific behavioral issues. Design AI-powered metrics that analyze trace data for off-topic responses, safety violations, or task completion failures. These custom metrics automatically evaluate your traces and surface problematic interactions that would otherwise go unnoticed in production.OpenAI Agents & custom providers
Scorecard works with any provider adhering to OpenTelemetry semantics. Out-of-the-box integrations:- OpenAI (ChatCompletion, Assistants/Agents)
- Anthropic Claude
- Google Gemini
- Groq LPU
- AWS Bedrock
instrument_http
or emit spans manually—see Custom Providers.
Use cases
- Production monitoring of LLM quality and safety
- Debugging slow/failed requests with full span context
- Auditing prompts/completions for compliance
- Attributing token cost and latency to services/cohorts
- Building evaluation datasets from real traffic (Trace to Testcase)
- Closing the loop with auto-scoring Monitors and linked Runs
Next steps
- Follow the Quickstart to send your first trace.
- Open the Colab notebook for an interactive tour.
- Convert live traffic to evaluations with Trace to Testcase.
- Add a Monitor to keep an eye on production quality.