Agent Validation

Reliability Exceeds Human Benchmarks

LLMs reliably assign the same predicted scores across independent runs at rates that exceed human benchmarks for the same type of task. These benchmarks are based on agreement across several hundred codes, at greater volume humans begin to perform worse. In contrast, LLMs can easily annotate thousands of records or conversations while maintaining agreement.

Validation: Repeatable

We published a validation study on 9,977 fan surveys from five MLB teams. GPT-4.1 read each open-text comment and predicted the 0–10 rating the fan had given on the same survey.

We ran the prompt three independent times against the same text. The prompt was deliberately plain: no examples, no calibration, and no domain-specific instructions.

The result was inter-labeler reliability scores that exceeded human benchmarks on all measures.

Measure	Definition	AI	Benchmark
Consensus	All three runs identical	80.9%	40–55%
Pairwise Exact	Exact matches between any two runs	87%	65–75%
Pairwise Tolerance	A match within ±1 for any two runs	99.9%	85–95%

Validation: Recoverable

We ran a validation study on 3,736 Givebutter support conversations that carried a customer-submitted satisfaction rating. GPT-4.1 read each conversation and predicted the 1–5 rating the customer had given, without ever seeing that rating.

Customer ratings are sparse. Of the ~70,260 support conversations in the window (November 2025–April 2026), only 3,736 carried a usable rating — fewer than 1 in 18. The predicted score is generated for every conversation, which is the point: it recovers a satisfaction signal for the ~95% of conversations that would otherwise have none.

The result was a predicted score that tracked customers' own ratings far more closely than tone-based sentiment, and reliably separated satisfied conversations from dissatisfied ones.

Measure	Definition	AI	Sentiment baseline
Correlation	Agreement with the customer's own 1–5 rating (Pearson r)	0.47	0.36
Low-rating detection	Accuracy separating low (1–3) from high (4–5) ratings	91%	74%
Exact match	Predicted score equals the customer's rating	66%	—
Within ±1	Predicted score within one point of the customer's rating	93%	—

Ratings skew strongly positive: 91% are 4 or 5. Exact-match and within-±1 scores partly reflect that skew. The model's clear advantage over sentiment shows in correlation and low-rating detection, where the base rate offers no free lift.

Mechanics: Why LLM Annotation is Reliable and Precise

A reasonable question to ask of any LLM-powered system: how do we know the labels are accurate?

The short answer is that enrichment is a recognition task, not a generation task — and LLMs are dramatically more reliable at recognition than at generation.

Task type	What it does	Hallucination risk
Generation	Write something new — an essay, a summary, a piece of code.	Higher. The model is producing content that didn't exist before.
Recognition	Identify whether a concept is present in existing text and label it.	Extremely low. The model is pattern-matching against text that's already there.

When used for enrichment, LLMs deliver on:

- Consistency — no variation between human coders.
- Contextual understanding — capturing nuance and meaning, not just keywords.
- Reliability — extremely low hallucination rates on recognition tasks.

Validation studies confirm hallucination rates in enrichment tasks are extremely low. Pattern recognition and contextual understanding are exactly what large language models are best at.