PolicyAI provides tools to evaluate your policy against your golden set (or any individual piece of content or CSV). This allows you to test your policy, iterate on it, and get to a place where you feel comfortable going live.

https://www.loom.com/share/6620d9e5d28a48f298ccff9eb6438850

Initiating a Test Run:

In the PolicyAI testing interface, select the policy version you want to test and the content you want to test. Start the test run.

Screenshot 2025-06-11 at 9.11.21 AM.png

When you upload a CSV file, you’ll see a dataset preview. From here you can choose which content from the CSV you want to test. If you don’t select any, then the entire CSV will run.

Screenshot 2025-08-27 at 9.36.11 AM.png

Viewing Test Results

PolicyAI will process your content using the selected policy and model, then compare the LLM's classifications against the human-assigned golden labels (if they’re included in a CSV). Results will include:

You can sort the results by the “is correct” column if you want to just see incorrect results.

You can also click the “download as CSV” button to save the results for future reference/ review.

Interpreting initial results

Diagnosing Results

We built diagnostic tools into PolicyAI to help you pinpoint issues quickly.

At the bottom of your Test Policies page, after running a test, select the results you’d like insights into. From there, click “Diagnose” on the bottom left.

Screenshot 2025-11-20 at 1.37.14 PM.png