How to test your policies and interpret initial results

PolicyAI provides tools to evaluate your policy against your golden set (or any individual piece of content or CSV). This allows you to test your policy, iterate on it, and get to a place where you feel comfortable going live.

https://www.loom.com/share/6620d9e5d28a48f298ccff9eb6438850

Initiating a Test Run:

In the PolicyAI testing interface, select the policy version you want to test and the content you want to test. Start the test run.

Screenshot 2025-06-11 at 9.11.21 AM.png

When you upload a CSV file, you’ll see a dataset preview. From here you can choose which content from the CSV you want to test. If you don’t select any, then the entire CSV will run.

Screenshot 2025-08-27 at 9.36.11 AM.png

Viewing Test Results

PolicyAI will process your content using the selected policy and model, then compare the LLM's classifications against the human-assigned golden labels (if they’re included in a CSV). Results will include:

Duration Stats: How long the AI model took to make a decision.
Total correct: How many decisions were labeled correctly in a binary of Safe/ Unsafe (or Safe/ Other)
Overall information, including:
- Count: The total number of examples or instances in each category being evaluated.
- FP (False Positives): Cases incorrectly classified as unsafe/problematic when they are actually safe/acceptable.
- FN (False Negatives): Cases incorrectly classified as safe/acceptable when they are actually unsafe/problematic.
- Precision: The ratio of correctly identified unsafe content to all content identified as unsafe. High precision means a low false positive rate.
- Recall: The ratio of correctly identified unsafe content to all actual unsafe content. High recall means a low false negative rate.
- Accuracy: Measures the fraction of all predictions the model got right (both positive and negative).
- F1: A balanced measure that combines precision and recall into a single metric, useful when you need to find a balance between catching unsafe content and minimizing false alarms.
The above stats by expected category (based on labels from your CSV). Selecting one of the checkboxes next to category will sort your Labeled Examples by only that category.
Labeled Examples: All examples with the golden set label, the LLM label, Severity, and a reason, allowing you to review them and understand why the model might have gotten some wrong.

You can sort the results by the “is correct” column if you want to just see incorrect results.

You can also click the “download as CSV” button to save the results for future reference/ review.

Interpreting initial results

Low overall accuracy or poor per-category metrics indicate issues with your policy, prompt, or ordering.
Reviewing individual misclassified examples provides concrete insights into where your policy definitions or examples might be ambiguous or insufficient.
Advanced analysis:
- Create a Confusion Matrix to help pinpoint which categories are being confused with each other. This is helpful for debugging (e.g., if Spam is often classified as Safe, your Spam definition might be too strict or the Safe definition too broad).

Diagnosing Results

We built diagnostic tools into PolicyAI to help you pinpoint issues quickly.

At the bottom of your Test Policies page, after running a test, select the results you’d like insights into. From there, click “Diagnose” on the bottom left.

Screenshot 2025-11-20 at 1.37.14 PM.png