Understanding LLM Behavior and Limitations

By @Alice Hunsberger

Working effectively with LLMs means understanding what they are good at and where their limitations lie. This helps set realistic expectations and diagnose issues.

How LLMs work

LLMs are complex statistical models that predict the most likely output based on patterns learned from training data. They don't "understand" in a human sense. When presented with your policy and content, they are finding the statistically most probable category based on the input.

Because language is often ambiguous, especially in moderation contexts, and their internal processes involve probability, LLMs can sometimes produce different outputs for the exact same input if the policy definition is not perfectly clear or the content is borderline. This variability, particularly with ambiguous policies, is a normal characteristic of LLMs, not necessarily a model error. It highlights areas where your policy needs greater precision.

You can “teach” LLMs by giving them better structure, boundaries, context, and information. This is done through prompting (in our case, through writing a policy, adding examples, and giving “AI moderator context”).

The importance of structure

Unlike humans, LLMs need structure to function reliably. How you structure and order policies is as important as the text of the policies themselves. A well-structured and logically formatted policy will perform much better than a raw human-readable policy. It’s important to put time and effort into structure, logic, and formatting when building policies for LLMs to read.

The importance of complete data

LLMs also rely heavily on the explicit instructions and definitions provided in your prompt and policy text. They cannot reliably infer intent, access real-world context outside the provided text, or apply external knowledge like local laws unless that information is clearly included or defined in the prompt.

PolicyAI processes each piece of content independently. LLMs do not remember previous content they've classified for the same user or understand concepts like "repeated violations" unless that history or context is explicitly provided as part of the input text for each piece of content.

A bias toward caution

LLMs can be inherently cautious or conservative when evaluating content, particularly in sensitive areas like adult content, violence, hate speech, or self-harm. While this caution is intended to prevent the model from generating harmful responses, it can result in over-flagging (false positives) when applied to moderation tasks, especially if your specific policy allows certain types of content within these sensitive domains.

Rigorous testing with a representative golden set is essential to identify where this over-caution is occurring. Pay close attention to false positives in sensitive violation categories.
Make your policy definitions for sensitive categories extremely precise. Crucially, include clear examples of the types of sensitive content that are allowed by your policy within the relevant category definition (or even the Safe category) to explicitly teach the model the distinction.
Explain in the AI Moderator Context box why you want decisions to be less conservative. This can be extremely effective, especially if your reasoning is in-line with the LLMs existing safety guardrails and values.

Language constraints

Large Language Models are trained on vast amounts of text data, but the distribution of this data across different languages is not always equal. This can impact their moderation effectiveness for certain languages. Generally, LLMs tend to perform best in high-resource languages for which there is abundant training data (e.g., English, Spanish, French, German, Mandarin).

Languages where LLMs might not moderate as effectively, or may require more careful tuning and testing, include:

Low-Resource Languages: These are languages with a smaller digital footprint and less available training data. The model may have a less nuanced understanding of grammar, slang, cultural context, and subtle forms of policy-violating content in these languages.