Developing an effective moderation policy with LLMs is rarely a one-time task. It's an ongoing cycle of creation, testing, analysis, and refinement. Content trends on your platform evolve, platform rules may change, and you'll gain deeper insights into how your LLM-powered policy performs on real data over time.

Recommended Cycle:

  1. Develop Initial Policy: Draft your policy categories and prompt and input it into PolicyAI.
  2. Prepare & Upload Golden Set: Create or update your golden set of human-labeled data.
  3. Test: Run tests in PolicyAI using your policy and golden set.
  4. Analyze Results: Review accuracy and misclassified examples to identify areas for improvement.
  5. Refine: Adjust your policy text, examples, prompt, or category ordering based on your analysis.
  6. Re-test: Validate the changes by testing the updated policy against your golden set.
  7. Monitor & Adapt: Continuously monitor the performance of your live policy and be prepared to repeat the cycle as needed to address new content types, evolving requirements, or model updates.

Data from testing (false positives, false negatives, confusion patterns) and reviewing specific misclassified examples are your primary signals for where policy refinement is needed.

Looking at variable results

Because LLMs can sometimes produce different outputs for the exact same input if the policy definition is not perfectly clear or the content is borderline (learn more here), it can be helpful to test the same examples multiple times to see how consistent results are.

Musubi has a unique workflow which allows you to use these variable results to your advantage to better strengthen your policies.

After you run your policy against a dataset, scroll up to the first “dataset preview & selection” section and choose which examples you want to test further.

From there, under the “preview selected rows” section, you can choose how many times you want to run those examples through the policy. We recommend running 5-10 times.

Screenshot 2025-08-11 at 2.16.50 PM.png

After you click “Run policy”, you will see how the LLM decided each time. In this example, the response was the same each time, but the severity was different.

Screenshot 2025-08-11 at 2.20.07 PM.png

In other cases, the decision may be completely different, which is a strong signal that the policy needs to be more clearly defined or more examples added.

The “reason” for the decision (found by scrolling to the right of the table above, or downloading the CSV) can be really helpful here, as slight differences in reasoning can sometimes highlight gaps in the policy.

Screenshot 2025-08-11 at 2.22.38 PM.png

Above, you can see that although each decision was correct, the reasoning isn’t consistent, so this may indicate a need to further define what a credible threat is and is not in the policy.