Tracing: Production monitoring

Monitor results – production traces with scores.

Scorecard Monitors periodically sample recent LLM spans, extract prompts and completions, and automatically score them with your selected metrics. The results appear where you work: on the Traces page (scored traces) and the Runs pages (history and per‑run aggregates).

Why it matters

Measure real production quality and safety continuously, not just in staging.
Detect drift early and pinpoint regressions to specific topics or services.
Close the loop between observability and evaluation with automatic scoring.
Quantify improvements after model/prompt updates with linked runs and trends.

If you’re coming from other tools

What it is: Very similar to observability dashboards (metrics over time, traces, filtering) — with one key addition: Scorecard runs evaluations/auto‑scoring on sampled traces, so you get quality metrics over time, not just system metrics.
Where scores show up: Inline on Traces for each scored span, and in Runs where you can analyze run‑level aggregates and trends.
What’s evaluated: Only spans that contain prompt and completion are scored; we support common keys (openinference.*, ai.prompt/ai.response, gen_ai.*).

Traces search page with scores created by a 'monitor'.

Create a Monitor

In the project sidebar select Monitors (<Icons.clock/>). You’ll land on the monitor overview page.
Click “New Monitor +” to open the Create Monitor modal.
Click Create Monitor and scoring starts on the next cycle.

Create monitor modal.

Inside the modal you can configure:

Metrics – choose any evaluation metric you’ve defined (toxicity, factuality, latency…).
Frequency – how often Scorecard samples traces (1 min, 5 min, 30 min, 1 h, 1 day).
Sample Rate – throttle evaluation cost (1 %–100 %).
Filters – hone in on traffic via spanName, serviceName, or full-text searchText.
Active – flip a switch to pause / resume without losing config.

Monitor options – Metrics

Keyword filtering with SearchText

Use SearchText to match any keywords or phrases embedded in your traces. It searches across span and resource attributes (including prompt/response fields), so you can:

Track sensitive topics (e.g., “refund policy”, “PCI”, “unsafe”) as dedicated monitors
Isolate incident-related traffic and watch the quality recover
Run targeted evaluations for specific features, intents, or cohorts

This turns production monitoring into topic-level QA: you’re not just watching everything, you’re watching the parts that matter.

Monitor options – sample & filter.

What happens after it runs

Monitors sample recent AI spans using deterministic, hash‑based sampling (stable slices) and create a Run.
Each sampled span is scored and appears inline on the Traces page with score chips; click any row to open the full trace.
From a scored trace you can follow the link to the corresponding Run to see run‑level details.

Where to view results

Traces: Browse scored spans, filter by keywords with SearchText, and jump into details for debugging.
Runs: See run history and performance over time, plus per‑run aggregates and plots on the run details pages.

Monitor results – Scores.

Manage monitors

Edit a monitor to change metrics, sampling, filters or toggle Active.
Delete a monitor to stop processing entirely.

Monitor overview list.

Edit monitor modal.

Use cases

Production monitoring of LLM quality and safety
Auto-scoring on real user traffic
Tracking model/prompt health over time

Ready to set one up? Follow the Online Evaluation & Production Monitoring Quickstart.

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

Tracing: Production monitoring

Why it matters

If you’re coming from other tools

Create a Monitor

Keyword filtering with SearchText

What happens after it runs

Where to view results

Manage monitors

Use cases

Introduction

Quickstarts

Core features

Advanced features

Governance, Risk, and Compliance

​Why it matters

​If you’re coming from other tools

​Create a Monitor

​Keyword filtering with SearchText

​What happens after it runs

​Where to view results

​Manage monitors

​Use cases

Why it matters

If you’re coming from other tools

Create a Monitor

Keyword filtering with SearchText

What happens after it runs

Where to view results

Manage monitors

Use cases