During the development of LLM applications, it is common practice to iteratively adjust the system to find the optimal setup that produces the best results.
For example, tweaking model parameters, trying different model versions, and developing and improving prompts will all affect the performance of the model’s responses. However, with multiple iterations and improvements, it becomes difficult to accurately quantify and compare the effectiveness of the changes made. Relying on gut feelings alone may not be enough, and gaining more confidence in your results is essential.
Scorecard addresses this need by providing an A/B Comparison feature for runs. This feature allows you to easily compare different runs using the same metrics, ensuring a clear understanding of the impact of your changes.
Only runs that are using the same set of evaluation metrics or the same Scoring Config can be compared with each other and are eligible for the A/B Comparison feature.
To use this feature:
Overview of Results of a Run
Overview of Results of a Run
Select Comparison Run
After specifying a comparison run, the graphs are updated to show the aggregated results for both the base run and the compared test run side by side. Investigate which run, and therefore which LLM setup, provides better results for which metrics.
A/B Comparison: Compare Metric Performance
After specifying a comparison run, the graphs are updated to show the aggregated results for both the base run and the compared test run side by side. Investigate which run, and therefore which LLM setup, provides better results in terms of run performance, such as cost or latency.
A/B Comparison: Compare Run Performance
Using A/B comparisons ensures that you make data-driven decisions, optimize the performance of your LLM, and continually improve its capabilities with confidence. Amongst others, the benefits of A/B comparisons include the ability to: