Creating Golden Datasets for Evaluating Policies and Models

Authored: Fil Jankovic (AI), Alice Hunsberger (T&S)

Let’s say you’ve written a Trust & Safety policy and you’ve picked out an LLM to apply the policy to your platform’s content. How do you compare two different LLMs to see which performs better? Or how do you detect if a change you’ve made to the policy improves or degrades accuracy?

The answer is a golden dataset. In AI evaluation, a golden dataset is your closest approximation to “ground truth.” It’s a set of high quality labeled examples that you craft, and it paints a full picture of just how you want your Trust & Safety policy applied.

The process, step-by-step:

1. Get clear on the goals

✅ What a golden dataset is for:

Benchmarking to compare different approaches.
Detecting regressions (when a change the policy or system results in worse performance).
Ensuring important cases, both rare and common, are handled well.

🚫 What it’s not for:

Reporting overall production accuracy – A golden dataset is deliberately overweighted with more tricky scenarios, so any metrics you compute on it will be biased downward relative to everyday traffic. If you want to calculate real-world accuracy, take a sample of production volume (not your golden dataset) and compare your team’s labels to the AI model’s labels.
Training the production model – Think of the golden dataset as the answer key to a final exam. If you were the teacher, you wouldn’t teach by handing out the answers; you’d teach the concepts and then use the test to evaluate. Like a test key, the golden dataset should contain real-world examples that are not used for fine-tuning, training or prompt-engineering.

2. Prepare a representative dataset of raw examples

Pull from real-world logs.
Make sure the dataset expresses the full variety of cases you’ll see in production, such as edge cases, borderline examples, and obvious cases.
Include both positive and negative examples for each area of your policy (scams, self harm, etc…)
The examples should be different from the examples you use to train / fine-tune your model.
Keep the dataset as small as possible while still covering all important scenarios. This will usually be 30-100 for basic policies, but it depends on complexity and how precise you want the decision boundary.

The underlying principle is that your golden dataset should be a benchmark that captures what you care about most. So if one model or policy has higher accuracy on your golden dataset than another, that gives you good confidence on which to choose.

Screenshot 2025-05-23 at 10.56.33 AM.png