> ## Documentation Index > Fetch the complete documentation index at: https://docs.scorecard.io/llms.txt > Use this file to discover all available pages before exploring further. # Metrics export const DarkLightImage = ({lightSrc, caption, alt, darkSrc = null, width = "1000"}) => { const getAbsoluteUrl = src => { if (src.startsWith('http://') || src.startsWith('https://')) { return src; } const currentUrl = typeof window !== 'undefined' ? window.location.origin : ''; if (currentUrl.includes('.mintlify.app')) { const subdomain = currentUrl.split('.')[0].replace('https://', ''); return `https://mintlify.s3.us-west-1.amazonaws.com/${subdomain}${src.startsWith('/') ? '' : '/'}${src}`; } else if (currentUrl === 'https://docs.scorecard.io') { return `https://mintlify.s3.us-west-1.amazonaws.com/scorecard-d65b5e8a${src.startsWith('/') ? '' : '/'}${src}`; } else { return `${currentUrl}${src.startsWith('/') ? '' : '/'}${src}`; } }; const content = <> {alt}

; if (caption) { return {content}; } else { return content; } }; Metrics serve as benchmarks for assessing the quality of LLM responses, while scoring is the process of applying these metrics to generate actionable insights about your LLM application. When evaluating LLM applications, you start with a "vibe check"—manually testing prompts to see if responses make sense. As your application evolves, you need a systematic way to quantify and consistently measure response quality. Metrics provide this standard of measurement for LLM evaluation. ## Assess Your LLM Quality With Scorecard's Metrics Scorecard's metric system is organized into three main components: * **Metrics**: Individual evaluation criteria that assess specific aspects of your LLM's performance * **Metric Groups**: Collections of related metrics for comprehensive evaluation * **Templates**: Pre-built metric configurations you can copy and customize Scorecard assists you in several ways in defining and organizing metrics for LLM evaluation. ### Metric Templates Our Scorecard Core Metrics represent industry-standard benchmarks for LLM performance, validated by our team of evaluation experts. Explore these templates on the Metrics page under "Templates". Copy any template and customize it for your specific use case. #### MLflow-Inspired Metrics Scorecard provides support for MLflow-style metrics with additional capabilities like aggregation, A/B comparison, and iteration. Available metrics include: * **Relevance**: Evaluates how well the response aligns with the input and context * **Answer Relevance**: Assesses relevance and applicability to the specific query * **Faithfulness**: Measures factual consistency with provided context * **Answer Correctness**: Evaluates accuracy against ground truth * **Answer Similarity**: Assesses semantic similarity to expected responses View full prompts on the Metrics page or in the [MLflow GitHub repository](https://github.com/mlflow/mlflow/blob/master/mlflow/metrics/genai/prompts/v1.py). #### RAGAS-Inspired Metrics for RAG Pipelines Scorecard provides RAGAS-style metrics for evaluating Retrieval Augmented Generation (RAG) systems: **Component-Wise:** * **Faithfulness**: Factual consistency with context * **Answer Relevancy**: Pertinence to the prompt * **Context Recall**: Retrieved context alignment with expected response * **Context Precision**: Ranking of ground-truth relevant items * **Context Relevancy**: Relevance of retrieved context to query **End-to-End:** * **Answer Semantic Similarity**: Semantic resemblance to ground truth * **Answer Correctness**: Accuracy compared to ground truth View full prompts on the Metrics page or in the [RAGAS GitHub repository](https://github.com/explodinggradients/ragas/tree/main/src/ragas/metrics). Templates for common MLflow and RAGAS metrics are available under Templates. Copy a template (e.g., Relevance, Faithfulness) and tailor the guidelines to your domain. ### Define Custom Metrics for Your LLM Use Case On the Metrics page, click "+ New Metric" to create a custom metric. Configure: * **Metric Name**: Human-readable name * **Metric Guidelines**: Natural language instructions for evaluation * **Evaluation Type**: * **AI**: Uses Metric Guidelines as prompt for an AI model * **Human**: Manual evaluation by subject-matter experts * **Heuristic (Code)**: Custom Python or TypeScript logic (see [Heuristic Metrics](#heuristic-metrics)) * **Output Type**: Boolean, Integer (1-5), or Float (0.0-1.0) ### Output Types #### Boolean Output Type * **Range**: true/false * **Pass/Fail**: Direct pass/fail representation * **Use cases**: Format checks, refusals, policy/guardrails, presence/absence * **Aggregation**: Pass ratios and counts Example model output for a boolean metric: ```json theme={null} { "reasoning": "The model correctly refused unsafe content.", "binaryScore": true } ``` #### Integer Output Type * **Range**: 1–5 (higher is better) * **Pass/Fail**: Set a `passingThreshold` (e.g., 4). Passes if `intScore ≥ threshold` * **Use cases**: Rubric-based quality judgments (helpfulness, factuality, completeness) * **Aggregation**: Means and distributions Example model output for an integer metric: ```json theme={null} { "reasoning": "Mostly correct but missing minor details.", "intScore": 4 } ``` #### Float Output Type Normalized score between 0.0 and 1.0 for graded, continuous measurement. * **Range**: 0.0–1.0 (higher is better) * **Pass/Fail**: Set a `passingThreshold` (e.g., 0.90). Passes if `floatScore ≥ threshold` * **Use cases**: Semantic similarity, confidence/uncertainty, coverage, quality scores * **Aggregation**: Means for trend tracking Example model output for a float metric: ```json theme={null} { "reasoning": "The response is mostly correct with minor omissions.", "floatScore": 0.87 } ``` ## Automated Scoring As testsets grow, manual evaluation becomes time-consuming. Subject-matter experts (SMEs) can become bottlenecks as the number of testcases increases. Scorecard's AI-powered scoring automates evaluation, allowing SMEs to focus on complex edge cases while the system handles well-defined metrics. ### AI-Based Metrics When creating a metric, select "AI" as the evaluation type for automated scoring. #### Configuration Modes * **Basic Mode**: Define metric guidelines using natural language instructions * **Advanced Mode**: Modify the full prompt template for sophisticated evaluation logic Be very specific with metric guidelines—they form the core instruction for AI scoring. #### Evaluation Settings Configure AI evaluation parameters: * **Model Selection**: Choose models like GPT-4o * **Temperature**: Control randomness (typically 0 for consistency) * **Additional Parameters**: Model-specific settings ### Automatic Scoring When you score records (from the testset page, the Records page, or through GitHub Actions), AI and Heuristic metrics are evaluated automatically. You can view scoring progress and results on the Records page. New to Scorecard? See the [UI Quickstart](/intro/ui-quickstart) to learn how to score your first records. If needed, you can score again by selecting records and clicking "Score Records". #### Score Explanations Scorecard provides explanations for each score, helping you understand and validate the AI's evaluation reasoning. ### Heuristic Metrics (Code-Based Evaluation) Write custom evaluation logic in Python or TypeScript for deterministic, rule-based scoring: * Checking for specific keywords or patterns * Validating JSON structure or format * Computing text similarity scores * Measuring response length or complexity * Custom business logic checks #### Creating a Heuristic Metric Select **Heuristic** as the evaluation type to access the code editor. #### Writing Heuristic Code Your code receives the record data and must return a score. Here's the structure: ```python theme={null} def score(inputs: dict, outputs: dict, expected: dict) -> dict: """ Evaluate the output and return a score. Args: inputs: The input fields from the testcase outputs: The generated output from your system expected: The expected output fields from the testcase Returns: A dict with 'score' (bool, int 1-5, or float 0.0-1.0) and optional 'reasoning' """ # Example: Check if output contains expected keyword output_text = outputs.get("response", "") expected_keyword = expected.get("keyword", "") if expected_keyword.lower() in output_text.lower(): return {"score": True, "reasoning": "Keyword found in response"} else: return {"score": False, "reasoning": "Keyword not found"} ``` ```typescript theme={null} function score( inputs: Record, outputs: Record, expected: Record ): { score: boolean | number; reasoning?: string } { // Example: Check response length const response = outputs.response || ""; const minLength = expected.minLength || 100; if (response.length >= minLength) { return { score: true, reasoning: `Response meets minimum length of ${minLength}` }; } else { return { score: false, reasoning: `Response too short: ${response.length} < ${minLength}` }; } } ``` #### Secure Sandbox Execution Code runs in a secure, isolated sandbox ensuring: * **Safety**: No external resource access * **Consistency**: Deterministic results * **Performance**: Optimized execution The sandbox has limited access to external libraries. Common utilities like string manipulation and JSON parsing are available. Contact support if you need specific libraries for your use case. ### Metric Groups Group related metrics for consistent, repeatable evaluation. Metric Groups let you apply multiple metrics when scoring without manual selection each time. Create groups for specific use cases like RAG applications or translation tasks. View and create Metric Groups on the Metrics page under "Metric Groups". Provide: * **Name**: Human-readable identifier * **Description**: Short summary of the group's purpose Click "Select Metrics" to choose which metrics to include in the group.