

Records page with filtering and history chart.
What is a Record?
A Record is an individual test execution within a run. Each record contains:- Inputs: The data sent to your AI system
- Outputs: The response generated by your system
- Expected (Labels): Ground truth or ideal responses for comparison
- Scores: Evaluation results from each metric
- Status: Whether scoring is pending, completed, or errored
Customizing the Table
Click Edit Table to customize which columns appear and their order. You can add, remove, and reorder columns including:- Base columns: ID, Created By, Created At
- Data fields: Inputs, Outputs, Expected
- Source: How the record was created (API, Playground, Kickoff, Trace)
- Metrics: Score columns for each metric in your project


Edit Table to customize columns and their order.
History Chart
The interactive histogram shows record distribution over time. Click any bar to filter records to that time period.Bulk Re-scoring
Select multiple records using the checkboxes, then click Re-score to re-evaluate them with your metrics. This is useful when:- You’ve updated a metric’s guidelines
- You want to apply new metrics to existing records
- You need to re-evaluate after fixing a configuration issue
Record Details
Click any record to view its full details. The details view differs based on how the record was created:Testcase-Based Records
Records created from testsets show:- Scores: Pass/fail status, reasoning, and metric properties for each evaluation
- Test Record Details: Input fields, expected outputs, and actual outputs


Testcase-based record showing scores and test details.
Trace-Based Records
Records created from production traces show:- Trace Overview: Duration, estimated cost, total tokens, and span count
- Spans: Individual LLM calls with timing and cost breakdown
- Model Usage: Which models were called and token counts


Trace-based record showing spans and trace overview.
Use Cases
- Cross-run analysis: Find patterns across multiple evaluation runs
- Debugging failures: Filter by
metric.status:failto investigate failing records - Quality review: Review records from specific time periods or sources
- Metric iteration: Re-score records after updating metric guidelines