Skip to main content
New to Scorecard? Head straight to the Tracing Quickstart or jump into our ready-to-run Google Colab notebook to see traces in under 5 minutes.
LLM observability means knowing exactly what happened in every generation—latency, token usage, prompts, completions, cost, errors and more. Scorecard’s Tracing collects this data automatically—via SDK wrappers, environment variables, OpenLLMetry, an LLM proxy, or direct OpenTelemetry—and displays rich visualisations in the Scorecard UI.

Why Tracing matters

  • Debug long or failing requests in seconds.
  • Audit prompts & completions for compliance and safety.
  • Attribute quality and cost back to specific services or users.
  • Feed production traffic into evaluations.

If you call it something else

  • Observability / AI spans / request logs: We capture standard OpenTelemetry traces and spans for LLM calls and related operations.
  • Agent runs / tools / function calls: These appear as nested spans in the trace tree, with inputs/outputs when available.
  • Prompt/Completion pairs: Extracted from common keys (openinference.*, ai.prompt / ai.response, gen_ai.*) so they can be turned into testcases and scored.

Instrumentation Methods

Choose the method that fits your stack. All methods send traces to Scorecard’s OpenTelemetry endpoint.
For frameworks with built-in OpenTelemetry support (Claude Agent SDK, OpenAI Agents SDK). No code changes—just set environment variables.
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-scorecard-api-key>"
export BETA_TRACING_ENDPOINT="https://tracing.scorecard.io/otel"
export ENABLE_BETA_TRACING_DETAILED=1
export OTEL_RESOURCE_ATTRIBUTES="scorecard.project_id=<your-project-id>"
# No code changes required — just set the env vars above and run your app.
import anyio
from claude_agent_sdk import (
    AssistantMessage,
    TextBlock,
    query,
)


async def main():
    async for message in query(prompt="What is 2 + 2?"):
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, TextBlock):
                    print(f"Claude: {block.text}")


anyio.run(main)
See the Claude Agent SDK Tracing guide for details.

View Traces in the Records Page

Ingested traces appear on the Records page alongside records from other sources (API, Playground, Kickoff). Each row in the table shows:
Records page showing a list of traced recordsRecords page showing a list of traced records
Each row shows the record’s inputs, outputs, status, source, trace ID, and metric scores. You can search by inputs, outputs, and expected values, or filter by status, source, trace.id, run.id, testcase.id, metric, metadata, time range, and more. Use Quick Filters for presets like “My Recent Records”. Customise visible columns via the Edit Table modal. Open a record to see more details including the span tree, conversation (for chat-based traces), timeline, and scores:
Record detail view showing spans with inputs, outputs, and timingRecord detail view showing spans with inputs, outputs, and timing

Trace Grouping

When running batch operations or multi-step workflows, you can group related traces into a single run using the scorecard.tracing_group_id span attribute. This makes it easier to track and analyze workflows that span multiple LLM calls.

How it works

Add the scorecard.tracing_group_id attribute to your spans with a shared identifier. Scorecard automatically groups spans with the same group ID into a single run.
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

# Use the same group_id for all related operations
group_id = "batch-job-123"

with tracer.start_as_current_span("process_document") as span:
    span.set_attribute("scorecard.tracing_group_id", group_id)
    # Your LLM call here
    
with tracer.start_as_current_span("summarize_results") as span:
    span.set_attribute("scorecard.tracing_group_id", group_id)
    # Another LLM call in the same workflow

Use cases

  • Batch processing: Group all documents processed in a single batch job
  • Multi-step agents: Track all LLM calls within an agent’s execution
  • Workflows: Link related operations across different services
  • A/B testing: Group traces by experiment variant for comparison
Traces with the same scorecard.tracing_group_id will appear together in the same run, making it easy to analyze aggregate metrics across related operations.

AI-Specific Error Detection

Scorecard’s tracing goes beyond technical failures to detect AI-specific behavioral issues that traditional observability misses. The system acts as an always-on watchdog, analyzing every AI interaction to catch both obvious technical errors and subtle behavioral problems that could impact user experience.

Silent Failure Detection

The most dangerous errors in AI systems are “silent failures” where your AI responds but incorrectly. Scorecard automatically detects behavioral errors including off-topic responses, workflow interruptions, safety violations, hallucinations, and context loss. These silent failures often go unnoticed without specialized AI observability but can severely impact user trust and application effectiveness. Technical errors like rate limits, timeouts, and API failures are captured automatically through standard trace error recording. However, AI applications also face unique challenges like semantic drift, safety policy violations, factual accuracy issues, and task completion failures that require intelligent analysis beyond traditional error logging.

Custom Error Detection

Create custom metrics through Scorecard’s UI to detect application-specific behavioral issues. Design AI-powered metrics that analyze trace data for off-topic responses, safety violations, or task completion failures. These custom metrics automatically evaluate your traces and surface problematic interactions that would otherwise go unnoticed in production.

Supported Frameworks & Providers

Scorecard traces LLM applications built with popular open-source frameworks through OpenLLMetry. OpenLLMetry provides automatic instrumentation for:

Application Frameworks

  • CrewAI – Multi-agent collaboration
  • Haystack – Search and question-answering pipelines
  • LangChain – Chains, agents, and tool calls
  • Langflow – Visual workflow builder
  • LangGraph – Multi-step workflows and state machines
  • LiteLLM – Unified interface for 100+ LLMs
  • LlamaIndex – RAG pipelines and document retrieval
  • OpenAI Agents SDK – Assistants API and function calling
  • Vercel AI SDK – Full-stack AI applications
Scorecard is featured as a recommended observability provider in the official Vercel AI SDK documentation and OpenAI Agents Python README.

LLM Providers

  • Aleph Alpha
  • Anthropic
  • AWS Bedrock
  • AWS SageMaker
  • Azure OpenAI
  • Cohere
  • Google Gemini
  • Google Vertex AI
  • Groq
  • HuggingFace
  • IBM Watsonx AI
  • Mistral AI
  • Ollama
  • OpenAI
  • Replicate
  • Together AI
  • and more

Vector Databases

  • Chroma
  • LanceDB
  • Marqo
  • Milvus
  • Pinecone
  • Qdrant
  • Weaviate
For the complete list of supported integrations, see the OpenLLMetry repository. All integrations are built on OpenTelemetry standards and maintained by the community.

Custom Providers

For frameworks or providers not listed above, you can use:
  • HTTP Instrumentation: OpenLLMetry’s instrument_http() for HTTP-based APIs
  • Manual Spans: Emit custom OpenTelemetry spans for proprietary systems
See the OpenLLMetry documentation for manual instrumentation guides.

Use cases

  • Production observability for LLM quality and safety
  • Debugging slow/failed requests with full span context
  • Auditing prompts/completions for compliance
  • Attributing token cost and latency to services/cohorts
  • Building evaluation datasets from real traffic (Trace to Testcase)

Next steps

  1. Follow the Tracing Quickstart to send your first trace.
  2. Set up Claude Agent SDK Tracing with zero code changes.
  3. Instrument LangChain chains and agents.
  4. Use the Vercel AI SDK wrapper for full-stack AI apps.
  5. Open the Colab notebook for an interactive tour.
Happy tracing! 🚀