Scorecard’s multi-turn simulation feature allows you to evaluate the performance of an LLM system in a multi-turn conversation. Define instructions for your simulated user in Scorecard, then run a simulation on your system using the Scorecard SDK.
Details of a Sim Agent.Details of a Sim Agent.

Sim Agent configuration page.

Create a Sim Agent

Sim Agents are configurable AI personas that interact with your system during testing. Each agent has a prompt template, model settings, and can be versioned for reproducibility.
Go to the Sim Agents page, then click “New Sim Agent”. Fill in the instructions for the Sim Agent then click “Save Sim Agent”.
We recommend using the model Gpt-4.1 for better simulated behavior.
Modal showing options to create a Sim Agent.Modal showing options to create a Sim Agent.

Modal to create a Sim Agent.

Prompt template

Sim Agent prompts support Jinja2 templating to inject testcase inputs dynamically:
You are an angry customer talking to customer service. Product: {{product_name}}
Issue: {{customer_complaint}}
Previous interactions: {{interaction_count}}
You want to {{customer_goal}} and will not be satisfied until resolved.
Never mention you’re a simulation or help the agent.
Variables like {{product_name}} are replaced with values from each testcase’s input fields.
See Prompts for information on referencing testcase inputs and using Jinja in your Sim Agent prompts.

Run a simulation

Multi-turn simulations execute conversations between your system and Sim Agents, capturing the full interaction for evaluation. Simulation runs are kicked off by calling multi_turn_simulation() from the Scorecard Python SDK.

System function

The system parameter is your code under test. It must be a callable that handles conversation turns. Function signature:
def your_system(
    # Complete conversation history
    chat_history: list[ChatMessage], 
    # Input fields from the current testcase
    testcase_inputs: dict[str, Any]
# Returns a list of assistant messages as strings.
# This will likely be a list containing a single string message.
) -> Iterable[str | ChatMessage]
Here’s an example of an LLM system under test:
Example system function
from openai import OpenAI
from scorecard_ai.lib import ChatMessage

system_prompt = """
You are a customer support agent for Amazon. Help the customer and remain polite and very concise. Try to figure out what the customer's needs are first, then continue by providing information or links to actions as appropriate following realistic Amazon guidelines.
"""

def customer_service_system(
    chat_history: list[ChatMessage],
    testcase_inputs: dict,
) -> list[str]:
    client = OpenAI()
    system_response = client.chat.completions.create(
        model="gpt-4.1",
        messages=chat_history,
    ).choices[0].message.content
    # Return a list containing the response content
    return [system_response]

    # Alternatively, return ChatMessage objects for more control.
    # This is useful if you want to track tool calls, which will be ignored by the simulated user agent.
    # return [ChatMessage(role="assistant", content=system_response)]

Initial messages

The initial_messages parameter seeds the conversation before simulation begins. It can be:
  • A list of ChatMessage objects (used for all testcases)
  • A function that takes testcase inputs and returns messages (for dynamic initialization)
  • Omitted (starts with an empty conversation)
# Option 1: Static initial messages for all testcases
initial_messages: list[ChatMessage] = [
    # System messages are ignored by the simulated user agent
    ChatMessage(role="system", content=system_prompt),
    # Pre-seed with an assistant greeting
    ChatMessage(
        role="assistant",
        content="Hello, how can I help you today?",
    ),
]

Running the simulation

Use multi_turn_simulation() to run the simulation across all testcases in a testset:
from scorecard_ai import Scorecard
from scorecard_ai.lib import ChatMessage, StopChecks, multi_turn_simulation

scorecard = Scorecard()

simulation_run = multi_turn_simulation(
    client=scorecard,
    project_id=PROJECT_ID, # e.g. "123"
    metric_ids=METRIC_IDS,  # e.g. ["123", "456"]
    testset_id=TESTSET_ID, # e.g. "456"
    sim_agent_id=SIM_AGENT_ID, # e.g. "abcdefgh-1234-5678-90ab-cdefgh01"
    system=customer_service_system,
    initial_messages=initial_messages,  # Or use get_initial_messages for dynamic initialization
    stop_check=StopChecks.max_turns(5),  # Optional: control the conversation stopping condition
    start_with_system=False,  # Optional: explicitly control who starts the conversation
)
The simulation automatically determines who starts the conversation based on initial_messages. If the last message is from the user, the system responds first. Use start_with_system to override this behavior.

Stop checks

Stop checks control when conversations end. They are functions that receive a ConversationInfo object and return True to stop the simulation. By default, the simulation runs for 5 “turns”, where a turn is the number of times the system function was called.
To prevent accidental infinite loops, simulations have a hard limit of 50 turns. If this limit is reached, the simulation ends automatically regardless of the stop check.

Built-in stop checks

Scorecard provides a few heuristic stop checks:
Stop CheckDescriptionExample usage
Max turnsStop after n conversation turnsStopChecks.max_turns(10)
ContentStop when any phrase appears
(case-insensitive)
StopChecks.content(["goodbye", "thank you"])
Max timeStop after elapsed time (seconds)StopChecks.max_time(30.0)
Combine stop checksYou can create complex stopping conditions by combining stop checks using StopChecks.any() and StopChecks.all().For example, StopChecks.any([StopChecks.max_turns(5), StopChecks.max_time(10)]) will end the simulation after at most 5 turns or after at most 10 seconds.

Custom stop checks

For more advanced use cases, you can also define your own stop check. For example, this stop check ends the simulation when the user is satisfied with the conversation.
Custom stop check for user satisfaction
from scorecard_ai.lib import ConversationInfo

def stop_check_user_is_satisfied(conversation_info: ConversationInfo) -> bool:
    if conversation_info["turn_count"] < 1:
        return False
    last_message = conversation_info["messages"][-1]
    if last_message["role"] != "user":
        return False
    # Evaluate the user's satisfaction with the conversation
    client = OpenAI()
    response = openai.responses.create(
        model="gpt-4.1-mini",
        instructions="You are given a conversation between a customer and an Amazon customer service agent. Determine if the customer is satisfied with the conversation. Say 'yes' if the conversation is over and the customer is satisfied, 'no' otherwise.",
        input=last_message["content"],
    )
    return response.output_text.lower().contains("yes")

Viewing simulation results

Calling multi_turn_simulation() creates a Scorecard Run. To view simulation results, go to the Run’s details page. Then, click on a Record ID to go to the record details page. The conversation history between the Sim Agent (“User”) and the system under test (“Assistant”) will be shown.
Chat history of a record.Chat history of a record.

Conversation chat history in the record details page.

Common patterns