Using AI to Interpret and Summarise Machine Learning Results

A trained model is only half the work. Translating PR-AUC, confusion matrices, and SHAP plots into a story stakeholders can act on is the other half — and the half most often skipped. AI is excellent at this translation when you brief it well. This topic shows you how to turn model artefacts into decision-ready summaries.

1. Introduction

Most model reviews fail in the room. The data scientist walks through the architecture, the cross-validation strategy, the precision-recall curve, and a SHAP summary plot — and the business audience nods politely while quietly deciding not to deploy. The story is technically correct but commercially mute. AI can transform raw metrics and explanation plots into language that supports decisions: who the model helps, where it fails, what the cost of errors is, and whether the trade-offs are acceptable. This tutorial gives you the prompts to do that translation reliably.

2. The Concept Explained

Interpreting a model has two layers. The first is global interpretation: how does the model perform overall, what features matter, where does it fail? The second is local interpretation: for this individual prediction, why did the model say what it said? The two require different prompts and different audiences. Executives want the global story; customers and analysts often need the local one.

Think of it like reviewing a footballer's season. The global view is the stats line — goals, assists, minutes played, conversion rate. The local view is the highlight reel — what happened on this particular goal, who passed, what the defender did wrong. Both views are needed, and both can be summarised by AI if you provide the underlying numbers.

The "result card" template

For every model review, prepare a result card the AI can summarise. It contains seven blocks: (1) problem framing, (2) data sizes, (3) primary metric on test, (4) operational metrics (precision at top-k, calibration), (5) top feature importances, (6) failure-mode segments, (7) cost asymmetry. Paste the card into the prompt and the AI will write a tight narrative tailored to your audience.

3. The Problem Without This Technique

Weak prompt

Summarise this confusion matrix for my boss.

No audience definition, no business context, no cost asymmetry. The AI will produce a generic textbook description ("the model has X true positives and Y false negatives") that won't help anyone make a decision. The boss will read it, blink, and ask the same question again.

Stronger prompt

Act as a senior ML lead writing for the VP of Marketing.

Model context: churn prediction (binary classification),
trained on customers_df (~480k rows), test set ~70k.
Threshold chosen to flag top 8% highest-risk customers
for proactive outreach (≈5,600 customers/month).

Results (test set, threshold = 0.62):
- PR-AUC: 0.41 (baseline prevalence: 0.07)
- Precision at top-8%: 0.34
- Recall at top-8%: 0.39
- Calibration: well-calibrated for scores < 0.6,
  over-confident above 0.85.
- Confusion matrix at threshold:
    TP: 1,940  FN: 3,020
    FP: 3,660  TN: 61,380
- Top SHAP features: support_tickets_30d (+),
  feature_usage_score (-), upgrade_events (-),
  days_since_last_login (+).

Business levers:
- Outreach cost per flagged customer: £18.
- Average annual revenue saved per retained
  high-risk customer: £840.

Audience: VP of Marketing (low tolerance for
technical detail). Length: ~120 words.
Output: 1) headline, 2) what this means
in revenue terms, 3) recommended next action,
4) one caveat.

You get a four-paragraph summary that opens with "the model flags 5,600 customers per month, of whom ~1,940 are real churners". It computes outreach economics (£100k spend, ~£1.6M saved revenue) and recommends a controlled rollout. The VP can act on it.

4. The Solution

The pattern is: model context → result card → business levers → audience → output format. The result card is the unlock. The cost asymmetry is the spike. Once AI knows the £/$ value of a true positive vs a false positive, the summary stops being a metrics tour and starts being a recommendation.

For SHAP interpretation specifically, paste the top-N feature importances and the direction of effect, plus one or two example customers with their SHAP values. The AI can write paragraphs like "customers flagged as high-risk are typically those with rising support ticket counts and falling product usage — exactly the behavioural pattern customer success teams already recognise". That is the kind of sentence that earns model adoption.

5. Step-by-Step Breakdown

Build the result card. Seven blocks: framing, sizes, primary metric, operational metrics, top features, failure-mode segments, cost asymmetry.
State the cost of each error type. £ per false negative, £ per false positive. This single fact transforms the summary.
Name the audience. Tolerance for technical detail, the decision they need to make, the format they prefer (deck, email, memo).
Demand the output structure. Headline, business impact, recommendation, caveat — the same four-piece template that works for any analytical communication.
For SHAP / local interpretation, supply examples. Two or three individual predictions with their SHAP values turn abstract feature importance into concrete narrative.
Critique pass. "Rewrite removing all hedging adjectives and replacing them with concrete numbers." Always.

Tip: Keep your "result card" template version-controlled alongside the model. Every retrain produces a fresh card; every card produces a fresh summary. Over a year, you build an audit trail of model performance and storytelling.

6. Practice Exercises

Exercise 1

For your most recent model, fill in the seven-block result card from scratch. Paste it into AI with three audience definitions (CEO, product manager, peer data scientist). Compare the three summaries — note how much detail moves up and down with audience.

Exercise 2

Prompt: "Given these top 10 SHAP feature importances and three example predictions (with per-feature contributions), write a paragraph that a customer success manager could use to understand why a specific customer was flagged. Include actionable suggestions for outreach."

Exercise 3

Ask AI to write a "model card" template (the kind popularised by the Google ML community) tailored to your team's domain. It should include sections for intended use, performance per segment, known failure modes, ethical considerations, and a sign-off checklist for production deployment.

7. Key Takeaways

Model interpretation has two layers — global and local — and they require different prompts and audiences.
Build a seven-block result card per model: framing, sizes, primary metric, operational metrics, features, failure segments, cost asymmetry.
State the cost of each error type in business units; this transforms the AI's summary from a metrics tour into a recommendation.
For SHAP narratives, paste both the global feature ranking and two or three local examples.
Use the same four-piece template — headline, impact, recommendation, caveat — that works for any analytical communication.

Discussion

Building Data Pipelines Using AI Prompt Workflows Prompt Patterns for Working with Large Datasets