Authored: Fil Jankovic (AI), Alice Hunsberger (T&S)

Let’s say you’ve written a Trust & Safety policy and you’ve picked out an LLM to apply the policy to your platform’s content. How do you compare two different LLMs to see which performs better? Or how do you detect if a change you’ve made to the policy improves or degrades accuracy?

The answer is a golden dataset. In AI evaluation, a golden dataset is your closest approximation to “ground truth.” It’s a set of high quality labeled examples that you craft, and it paints a full picture of just how you want your Trust & Safety policy applied.

The process, step-by-step:

1. Get clear on the goals

✅ What a golden dataset is for:

🚫 What it’s not for:

2. Prepare a representative dataset of raw examples

The underlying principle is that your golden dataset should be a benchmark that captures what you care about most. So if one model or policy has higher accuracy on your golden dataset than another, that gives you good confidence on which to choose.

Screenshot 2025-05-23 at 10.56.33 AM.png