Predicted Scores
Predicted scoring in Dimension Labs represents a significant advancement in extracting meaningful insights from unstructured conversational data.
Predictive Scoring
Predictive scores use AI to infer structured ratings from unstructured text. Instead of asking customers how they felt, the model reads the conversation and predicts how they would have answered. The result is a score for every interaction in your dataset — not just the 5–15% where someone filled out a survey.
This guide covers the scoring dimensions available in Dimension Labs — from simplest to most construct-specific — and explains how to build custom predictive scoring prompts that produce reliable, analytically useful output.
Scoring as signal detection
The most basic application of a score is filtering signal from noise. Before you measure satisfaction, you often need to answer a simpler question: is this record worth analyzing at all?
Relevance scoring assigns a 0–10 rating based on whether a record contains meaningful signal for your analysis. It's the entry point to predictive scoring — not measuring experience, just separating the useful from the irrelevant.
Example — Diabetes relevance scoring
Score 0–10 for relevance to the authors personal lived experience with diabetes.
High scores: firsthand experiences including diagnosis, symptoms, blood sugar,
insulin, CGMs, pumps, medications, food management, doctor visits, complications,
or emotional impact. Abbreviations may be used: T1D = type 1 diabetes,
T2D = type 2 diabetes.
Lower scores: generic discussion, news, advocacy, fundraising, jokes, awareness
posts, or secondhand references.
10 = clearly firsthand and highly relevant
0 = not relevant at all
Example — Disney experience relevance
Score 0–10 for relevance to a personal experience visiting a Disney location
or personally using a Disney product.
High scores: firsthand, recent experience only.
Lower scores: links, booking posts, generic recommendations, products purchasable
outside the parks, visitor-count commentary, or trips older than 2–3 years.
Override: if the post contains "RT", "QT", or @Poshmarkspp, assign 1.
10 = clearly firsthand and highly relevant
0 = not relevant at all
Both prompts share a structure: define what a high score means, specify what gets lower scores, and anchor the endpoints. The Disney prompt adds hard override rules for known spam patterns. The diabetes prompt defines high-score triggers in detail because the relevant vocabulary is domain-specific and the model benefits from explicit cues.
When to use it: Social listening, review analysis, survey open-ends, or any data source where a significant portion of records may not contain analyzable signal. Relevance scoring reduces noise before other dimensions run, improving the quality of everything downstream.
How to customize: Define what "relevant" means for your analysis. List the noise patterns in your data source. Add hard overrides for known spam or repost patterns. Anchor the endpoints explicitly.
Predicted Rating
Predicted Rating is the universal starting point for experience scoring — a standard dimension, available out of the box with no configuration, that assigns a 1–10 rating reflecting how well a customer's needs were handled.
An estimation of the success in handling the users requests,
rated on a scale of 1 to 10 (with 10 being the best).
This prompt is deliberately simple. It doesn't specify a CX construct (satisfaction, effort, advocacy), doesn't require behavioral anchors, and doesn't assume anything about the data source. That generality is the point: Predicted Rating works across support conversations, bot interactions, survey responses, and reviews without any tuning.
It measures a blended, intuitive read of "how did this interaction go?" — a composite signal mixing resolution, tone, effort, and communication quality into a single number. What it doesn't tell you is why something went well or poorly. A Predicted Rating of 4 flags a problem; the decomposed scores (below) explain what caused it.
Cross-tab value: Rating by engagement category shows which topics produce the worst experiences. Rating by agent shows performance variation. Rating by channel shows where the experience breaks down. The score itself is the filter; the other dimensions provide the explanation.
Predicted CSAT
Predicted CSAT narrows the measurement from "how did this go?" to a specific construct: customer satisfaction. It maps to the standard CSAT survey instrument — a 5-point scale from Very Dissatisfied to Very Satisfied — which means it can be validated against actual survey data and compared to industry benchmarks.
The simple form
An estimation of the customers satisfaction with this interaction.
1 = Very Dissatisfied
2 = Somewhat Dissatisfied
3 = Neither satisfied nor dissatisfied
4 = Somewhat Satisfied
5 = Very Satisfied
Output null if insufficient signal.
This works for dashboards and trend monitoring. Its limitation is the same as the survey it mirrors: a score of 2 doesn't tell you whether the customer was dissatisfied with the agent, the outcome, the product, or the process.
Confidence scoring
Any predictive score is an inference — some inferences are stronger than others. A confidence companion field lets you filter for high-certainty scores in reporting and flag ambiguous records for review.
Confidence in the assigned csat_rating.
Assess based on the strength and clarity of sentiment signals in the conversation.
High — Clear, unambiguous satisfaction or dissatisfaction through explicit language,
tone, or repeated sentiment signals.
Medium — Sentiment is inferable but not explicitly stated, or the transcript
contains mixed signals.
Low — Best guess based on limited or ambiguous language. The conversation may be
short, transactional, or lack emotional cues.
If csat_rating is null, output null.
Add a confidence companion whenever transcript length and signal density vary significantly across your dataset. Filter to High confidence for reporting; review Low confidence records to understand where the model is guessing. Confidence scoring applies to any predictive score, not just CSAT.
Decomposed scoring for conversational data
The simple form works across data sources. Decomposed scoring is designed specifically for conversational data — support transcripts, chat logs, call transcripts — where the interaction involves an agent, a process, and a product.
Satisfaction in these contexts is not one thing. An agent can be excellent while the outcome is poor (policy limitation). The product can be terrible while the agent is empathetic and the outcome is a successful workaround. A single score collapses these signals. Decomposition preserves them — and routes each signal to the team that can act on it.
Agent satisfaction → coaching and training
Rate the customers satisfaction with the agent only: clarity, empathy,
knowledge, responsiveness, and follow-through.
Ignore: product issues, policies, system limits, wait times, and final
resolution — unless clearly attributed to the agent.
Base the score on the customers expressed or strongly implied view of the agent.
1 = Very Dissatisfied
2 = Dissatisfied
3 = Neutral
4 = Satisfied
5 = Very Satisfied
Null if no human agent appears or evidence is insufficient.
Outcome satisfaction → process and policy
Rate the customers satisfaction with the outcome only: whether the issue
was solved, request fulfilled, or question answered.
Ignore: agent tone, empathy, and effort — unless they directly changed
the result.
Base the score on the actual or clearly expected result.
1 = Very Dissatisfied
2 = Dissatisfied
3 = Neutral / Pending / Mixed
4 = Satisfied
5 = Very Satisfied
Null if no outcome stage is reached or evidence is insufficient.
Product satisfaction → product development
Rate the customers satisfaction with the product or service only.
Ignore: support quality, agent behavior, policy, and process.
Base the score on the customers expressed or strongly implied view
of the product/service itself.
1 = Very Dissatisfied
2 = Dissatisfied
3 = Neutral / Mixed
4 = Satisfied
5 = Very Satisfied
Null if no evaluable product/service sentiment is expressed
or evidence is insufficient.
Why the prompts are structured this way
Explicit isolation. Each prompt names what to ignore, not just what to evaluate. Without exclusion lists, the model bleeds signals across dimensions — penalizing agents for product failures, or inflating outcome scores because the agent was warm.
Customer's view, not yours. "Base the score on the customer's expressed or strongly implied view" asks the model to read the customer's experience rather than evaluate against an external rubric. This framing can be validated against survey data. A QA-style rubric cannot.
Concrete null conditions. Each prompt's null rules match the situations where that specific dimension has no signal — no agent for agent satisfaction, no resolution stage for outcome, no product sentiment for product satisfaction.
When to use which form
Use the simple form for dashboards, benchmarks, or trend monitoring where diagnosing drivers isn't the goal.
Use the decomposed form when analyzing conversational data and the objective requires understanding why satisfaction is high or low, or when different teams act on different components.
In most conversational data implementations, decomposition is the better default. It costs three fields instead of one, and the analytical value is substantially higher.
Transactional NPS
NPS is traditionally a relationship metric — how a customer feels about your brand overall. Transactional NPS narrows that to the individual interaction: did this specific experience create or erode brand advocacy?
This is a different construct from satisfaction. A customer can have a satisfactory interaction that wouldn't move them to recommend you. A single bad experience can turn a long-time promoter into a detractor. Transactional NPS asks whether this interaction changed the customer's willingness to put their reputation behind yours.
Research-anchored version
Predict the Net Promoter Score the customer would give based on this interaction.
This measures brand advocacy impact, not just satisfaction.
0–6 = Detractor
Concrete complaints naming specific failures, broken promises or unmet
commitments, repeat contact about unresolved issues, explicit unwillingness
to recommend, language indicating active consideration of alternatives.
7–8 = Passive
Functional and transactional tone, issue resolved but no emotional engagement,
hedging or conditional language ("it's fine", "not bad"), absence of both
frustration and enthusiasm.
9–10 = Promoter
Emotional and relational language, references to specific people or features
that exceeded expectations, expressed enthusiasm or gratitude, language
indicating willingness to recommend or refer others.
Null if the interaction is too short or transactional to carry meaningful
advocacy signal.
Streamlined version
Predict the Net Promoter Score the customer would give based on this interaction.
Focus on whether this experience would make the customer more or less likely
to recommend the brand — not just whether they were satisfied.
0–6 = Detractor (frustrated, unresolved issues, broken commitments)
7–8 = Passive (satisfied but indifferent, functional tone, no strong emotion)
9–10 = Promoter (delighted, exceeded expectations, would actively recommend)
Null if insufficient signal.
The research-anchored version produces more consistent scoring because the behavioral descriptions are detailed enough to reduce model interpretation. The streamlined version is faster to iterate on and sufficient when precision within NPS buckets matters less than the overall distribution.
Where it gets actionable
On its own, transactional NPS tells you the what. The value comes from cross-tabulation. Layering in issue type, product, channel, agent, or resolution status moves you from "we have a detractor problem" to "we have a detractor problem driven by billing disputes in chat that go unresolved after multiple contacts."
Predicted NPS also provides coverage that survey-based NPS can't. Survey response rates run as low as 7%, and promoters are significantly more likely to respond than detractors — meaning the customers you most need to hear from are the least likely to tell you.
A note on directionality: Predicted transactional NPS is a directional indicator, not a reportable NPS number. Treat it as a signal for where to focus — identifying detractor-range interactions, tracking distribution shifts — rather than a number for an earnings report.
Writing custom scoring prompts
The templates above cover the most common scoring needs. When your objective requires a different construct — specific to your business, data source, or research question — these principles produce reliable output.
Predict experience, not quality
Frame the task as "rate the customer's satisfaction with X" rather than "rate the quality of X." Experience prediction reads tone, language, and reactions to infer how the customer felt — and can be validated against survey data. Quality assessment requires a separate QA framework.
Weak prompt:
Rate the quality of the support provided by the agent based on clarity
of communication, accuracy of troubleshooting, empathy shown, and
follow-through on resolution.
1 = Very Poor, 5 = Excellent.
Strong prompt:
Rate the customers satisfaction with the agent only: clarity, empathy,
knowledge, responsiveness, and follow-through.
Ignore: product issues, policies, system limits, wait times, and final
resolution unless clearly attributed to the agent.
1 = Very Dissatisfied
5 = Very Satisfied
Null if no human agent appears or evidence is insufficient.
The weak version bundles criteria with no isolation and no null conditions. The strong version names what to evaluate, what to ignore, and when to output null.
Name what to ignore
Isolation clauses are as important as scoring criteria. "Rate satisfaction with the agent" is ambiguous — the model may factor in the outcome, the wait, the product. Explicitly listing what to ignore draws the boundary.
Add confidence when signal varies
Pair scores with a confidence field when your data includes transcripts of different lengths and densities. A CSAT of 2 with High confidence is a clear signal. A CSAT of 2 with Low confidence may just be a short exchange with no real sentiment.
Write concrete null conditions
"Null if insufficient signal" is a necessary fallback but shouldn't be the only rule. Name the specific situations: no human agent, transcript too short, no resolution stage, no product sentiment expressed.
Use null, not zero
Zero distorts averages, distributions, and percentile rankings. Always use null for insufficient information.
Match scale to construct
1–5 for satisfaction (maps to CSAT surveys). 0–10 for NPS (required by three-bucket methodology). 1–10 for general ratings. 0–10 for relevance. Always integers.
Type and enum must agree
// Wrong — type is integer but enum values are strings
"type": "integer",
"enum": ["1", "2", "3", "4", "5"]
// Correct
"type": "integer",
"enum": []Interpretation principles
Scores are directional. Use them to identify patterns, prioritize action, and surface risk — not to replace formal survey reporting. Cross-tabulation against other dimensions produces more reliable insight than any single score in isolation.
Confidence controls quality. Filter to High for reporting. Review Low to understand where the model lacks signal.
Nulls are intentional. Exclude them in analysis — don't treat them as zeros.
Distribution over individual scores. A single prediction is a rough estimate. The distribution across hundreds of interactions is a reliable signal. Track shifts over time.
Decomposed scores route action. Agent scores → CX leadership. Outcome scores → operations. Product scores → product development.
Surveys still matter. Predicted scores extend measurement to every interaction, but ongoing survey collection serves as a calibration anchor. Divergence between predicted and survey distributions is a signal to investigate.
Updated about 1 month ago
