Agent Validation
You can rely on our Agent to deliver repeatable results. We've done the research.
Reliability Exceeding Human Benchmarks
Validation Study
We published a validation study on 9,977 fan surveys from five MLB teams. GPT-4.1 read each open-text comment and predicted the 0–10 rating the fan had given on the same survey.
We ran the prompt three independent times against the same text. The prompt was deliberately plain: no examples, no calibration, and no domain-specific instructions.
The result was inter-labeler reliability scores that exceeded human benchmarks on all measures.
| Measure | Definition | AI | Benchmark |
|---|---|---|---|
| Consensus | All three runs identical | 80.9% | 40–55% |
| Pairwise Exact | Exact matches between any two runs | 87% | 65–75% |
| Pairwise Tolerance | A match within ±1 for any two runs | 99.9% | 85–95% |
Reliability and Precision
A reasonable question to ask of any LLM-powered system: how do we know the labels are accurate?
The short answer is that enrichment is a recognition task, not a generation task — and LLMs are dramatically more reliable at recognition than at generation.
| Task type | What it does | Hallucination risk |
|---|---|---|
| Generation | Write something new — an essay, a summary, a piece of code. | Higher. The model is producing content that didn't exist before. |
| Recognition | Identify whether a concept is present in existing text and label it. | Extremely low. The model is pattern-matching against text that's already there. |
When used for enrichment, LLMs deliver on:
- Consistency — no variation between human coders.
- Contextual understanding — capturing nuance and meaning, not just keywords.
- Reliability — extremely low hallucination rates on recognition tasks.
Validation studies confirm hallucination rates in enrichment tasks are extremely low. Pattern recognition and contextual understanding are exactly what large language models are best at.
Updated 1 day ago
