By @Alice Hunsberger

Working effectively with LLMs means understanding what they are good at and where their limitations lie. This helps set realistic expectations and diagnose issues.

How LLMs work

LLMs are complex statistical models that predict the most likely output based on patterns learned from training data. They don't "understand" in a human sense. When presented with your policy and content, they are finding the statistically most probable category based on the input.

Because language is often ambiguous, especially in moderation contexts, and their internal processes involve probability, LLMs can sometimes produce different outputs for the exact same input if the policy definition is not perfectly clear or the content is borderline. This variability, particularly with ambiguous policies, is a normal characteristic of LLMs, not necessarily a model error. It highlights areas where your policy needs greater precision.

You can “teach” LLMs by giving them better structure, boundaries, context, and information. This is done through prompting (in our case, through writing a policy, adding examples, and giving “AI moderator context”).

The importance of structure

Unlike humans, LLMs need structure to function reliably. How you structure and order policies is as important as the text of the policies themselves. A well-structured and logically formatted policy will perform much better than a raw human-readable policy. It’s important to put time and effort into structure, logic, and formatting when building policies for LLMs to read.

The importance of complete data

LLMs also rely heavily on the explicit instructions and definitions provided in your prompt and policy text. They cannot reliably infer intent, access real-world context outside the provided text, or apply external knowledge like local laws unless that information is clearly included or defined in the prompt.

PolicyAI processes each piece of content independently. LLMs do not remember previous content they've classified for the same user or understand concepts like "repeated violations" unless that history or context is explicitly provided as part of the input text for each piece of content.

A bias toward caution

LLMs can be inherently cautious or conservative when evaluating content, particularly in sensitive areas like adult content, violence, hate speech, or self-harm. While this caution is intended to prevent the model from generating harmful responses, it can result in over-flagging (false positives) when applied to moderation tasks, especially if your specific policy allows certain types of content within these sensitive domains.

Language constraints

Large Language Models are trained on vast amounts of text data, but the distribution of this data across different languages is not always equal. This can impact their moderation effectiveness for certain languages. Generally, LLMs tend to perform best in high-resource languages for which there is abundant training data (e.g., English, Spanish, French, German, Mandarin).

Languages where LLMs might not moderate as effectively, or may require more careful tuning and testing, include: