PolicyAI provides tools to evaluate your policy against your golden set (or any individual piece of content or CSV). This allows you to test your policy, iterate on it, and get to a place where you feel comfortable going live.
https://www.loom.com/share/6620d9e5d28a48f298ccff9eb6438850
Initiating a Test Run:
In the PolicyAI testing interface, select the policy version you want to test and the content you want to test. Start the test run.

When you upload a CSV file, you’ll see a dataset preview. From here you can choose which content from the CSV you want to test. If you don’t select any, then the entire CSV will run.

PolicyAI will process your content using the selected policy and model, then compare the LLM's classifications against the human-assigned golden labels (if they’re included in a CSV). Results will include:
Duration Stats: How long the AI model took to make a decision.
Total correct: How many decisions were labeled correctly in a binary of Safe/ Unsafe (or Safe/ Other)
Overall information, including:
The above stats by expected category (based on labels from your CSV). Selecting one of the checkboxes next to category will sort your Labeled Examples by only that category.
Labeled Examples: All examples with the golden set label, the LLM label, Severity, and a reason, allowing you to review them and understand why the model might have gotten some wrong.

You can sort the results by the “is correct” column if you want to just see incorrect results.
You can also click the “download as CSV” button to save the results for future reference/ review.
We built diagnostic tools into PolicyAI to help you pinpoint issues quickly.
At the bottom of your Test Policies page, after running a test, select the results you’d like insights into. From there, click “Diagnose” on the bottom left.
